[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15226956#comment-15226956 ]
Nathan Roberts commented on YARN-4924: -------------------------------------- Observed the following race with NM recovery. 1) ContainerManager handles a FINISH_APPS event causing storeFinishedApplication() to be recorded in state store (e.g. if RM kills application) 2) Prior to cleaning up the containers associated with this application, the NM dies 3) When NM restarts it attempts to recover the Application, Containers, and FinishedApplication events all associated with this application, in that order 4) This leads to a NEW to DONE transition for the containers, which will not try to cleanup the actual container since this is supposed to be a pre-LAUNCHED transition iiuc, this happens because when the application transitions from NEW to INITING during Application recovery, the containerInitEvents aren't actually dispatched yet. They are delayed until the AppInitDoneTransition. However, the AppInitDoneTransition may not occur until after the recovery code has handled the FinishedApplicationEvent and queued up KILL_CONTAINER events. So, in effect, the containerKillEvents passed up the containerInitEvents leading to the NEW to DONE transition. {noformat} 2016-04-04 18:20:45,513 [main] INFO application.ApplicationImpl: Application application_1458666253602_2367938 transitioned from NEW to INITING 2016-04-04 18:20:56,437 [AsyncDispatcher event handler] INFO application.ApplicationImpl: Adding container_e08_1458666253602_2367938_01_000004 to application application_1458666253602_2367938 2016-04-04 18:20:57,062 [AsyncDispatcher event handler] INFO application.ApplicationImpl: Application application_1458666253602_2367938 transitioned from INITING to FINISHING_CONTAINERS_WAIT 2016-04-04 18:20:57,095 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e08_1458666253602_2367938_01_000004 transitioned from NEW to DONE 2016-04-04 18:20:57,120 [AsyncDispatcher event handler] INFO application.ApplicationImpl: Removing container_e08_1458666253602_2367938_01_000004 from application application_1458666253602_2367938 2016-04-04 18:20:57,120 [AsyncDispatcher event handler] INFO application.ApplicationImpl: Application application_1458666253602_2367938 transitioned from FINISHING_CONTAINERS_WAIT to APPLICATION_RESOURCES_CLEANINGUP {noformat} > NM recovery race can lead to container not cleaned up > ----------------------------------------------------- > > Key: YARN-4924 > URL: https://issues.apache.org/jira/browse/YARN-4924 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 3.0.0, 2.7.2 > Reporter: Nathan Roberts > > It's probably a small window but we observed a case where the NM crashed and > then a container was not properly cleaned up during recovery. > I will add details in first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)