[ https://issues.apache.org/jira/browse/OOZIE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Satish Subhashrao Saley reassigned OOZIE-2326: ---------------------------------------------- Assignee: Satish Subhashrao Saley > oozie/yarn/spark: active container remains after failed job > ----------------------------------------------------------- > > Key: OOZIE-2326 > URL: https://issues.apache.org/jira/browse/OOZIE-2326 > Project: Oozie > Issue Type: Bug > Components: workflow > Affects Versions: 4.1.0 > Environment: pseudo-distributed (single VM), CentOS 6.6, CDH 5.4.3 > Reporter: Diana Carroll > Assignee: Satish Subhashrao Saley > Attachments: container-logs.txt, ooziejob-logs.txt, yarnbug1.png, > yarnbug2.png > > > Issue occurs when I launch a Spark job (local mode) that fails. (My example > failed because I tried to read a non-existent file). When this occur, the > job fails, and YARN ends up in a weird state: the RM manager shows the launch > job has completed...but a container for the job is still live on the slave > node. Because I'm running in pseudo-dist mode, this totally hangs my > cluster: no other jobs can run because there are only resources for a single > container, and that container is running the dead Oozie launcher. > If I wait long enough, YARN will eventually time out and release the > container and start accepting new jobs. But until then I'm dead in the water. > Attaching screen shots that show the state right after running the failed job: > the RM shows no jobs running > the node shows one container running > Also attaching a log file for the oozie job and the container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)