[ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482716#comment-13482716 ]
Robert Joseph Evans commented on YARN-167: ------------------------------------------ I am rather nervous about back porting MAPREDUCE-3353. It is a major feature that has a significant footprint and was not all that stable when it first went in. I know that it has since stabilized but I am still nervous about such a large change. It seems like it would be simpler to handle the KILL events in the states that missed it. > AM stuck in KILL_WAIT for days > ------------------------------ > > Key: YARN-167 > URL: https://issues.apache.org/jira/browse/YARN-167 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 0.23.3 > Reporter: Ravi Prakash > Assignee: Vinod Kumar Vavilapalli > Attachments: TaskAttemptStateGraph.jpg > > > We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them > as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a > few maps running. All these maps were scheduled on nodes which are now in the > RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira