[ https://issues.apache.org/jira/browse/MAPREDUCE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135232#comment-13135232 ]
Todd Lipcon commented on MAPREDUCE-3260: ---------------------------------------- There seem to be some reducers stuck in KILLING state on some of the nodes. The only non-daemon thread is: {code} "main" prio=10 tid=0x0000000046f7b800 nid=0x3774 waiting on condition [0x000000004033e000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at java.lang.Thread.sleep(Thread.java:298) at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:328) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:117) {code} Logs in the NM show the following which looks like a race: {code} 2011-10-25 00:44:42,842 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1319528200416_0004_01_002409 of type KILL_CONTAINER 2011-10-25 00:44:42,842 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1319528200416_0004_01_002409 transitioned from LOCALIZED to KILLING 2011-10-25 00:44:42,842 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Processing container_1319528200416_0004_01_002409 of type CONTAINER_LAUNCHED 2011-10-25 00:44:42,860 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Can't handle this event at current state: Current: [KILLING], eventType: [CONTAINER_LAUNCHED] org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_LAUNCHED at KILLING at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:803) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:70) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:373) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:366) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:116) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) at java.lang.Thread.run(Thread.java:619) 2011-10-25 00:44:42,879 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1319528200416_0004_01_002409 transitioned from KILLING to null {code} > Yarn app stuck in KILL_WAIT state > --------------------------------- > > Key: MAPREDUCE-3260 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3260 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, resourcemanager > Affects Versions: 0.23.0 > Reporter: Todd Lipcon > Priority: Critical > > Last night I killed an MR2 app using "hadoop job -kill". This morning I > noticed it's still running, but in "KILL_WAIT" state with no tasks running. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira