[ 
https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15226956#comment-15226956
 ] 

Nathan Roberts commented on YARN-4924:
--------------------------------------

Observed the following race with NM recovery.

1) ContainerManager handles a FINISH_APPS event causing 
storeFinishedApplication() to be recorded in state store (e.g. if RM kills 
application)
2) Prior to cleaning up the containers associated with this application, the NM 
dies
3) When NM restarts it attempts to recover the Application, Containers, and 
FinishedApplication events all associated with this application, in that order
4) This leads to a NEW to DONE transition for the containers, which will not 
try to cleanup the actual container since this is supposed to be a pre-LAUNCHED 
transition

iiuc, this happens because when the application transitions from NEW to INITING 
during Application recovery, the containerInitEvents aren't actually dispatched 
yet. They are delayed until the AppInitDoneTransition. However, the 
AppInitDoneTransition may not occur until after the recovery code has handled 
the FinishedApplicationEvent and queued up KILL_CONTAINER events. So, in 
effect, the containerKillEvents passed up the containerInitEvents leading to 
the NEW to DONE transition. 

{noformat}
2016-04-04 18:20:45,513 [main] INFO application.ApplicationImpl: Application 
application_1458666253602_2367938 transitioned from NEW to INITING
2016-04-04 18:20:56,437 [AsyncDispatcher event handler] INFO 
application.ApplicationImpl: Adding 
container_e08_1458666253602_2367938_01_000004 to application 
application_1458666253602_2367938
2016-04-04 18:20:57,062 [AsyncDispatcher event handler] INFO 
application.ApplicationImpl: Application application_1458666253602_2367938 
transitioned from INITING to FINISHING_CONTAINERS_WAIT
2016-04-04 18:20:57,095 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e08_1458666253602_2367938_01_000004 transitioned from NEW to DONE
2016-04-04 18:20:57,120 [AsyncDispatcher event handler] INFO 
application.ApplicationImpl: Removing 
container_e08_1458666253602_2367938_01_000004 from application 
application_1458666253602_2367938
2016-04-04 18:20:57,120 [AsyncDispatcher event handler] INFO 
application.ApplicationImpl: Application application_1458666253602_2367938 
transitioned from FINISHING_CONTAINERS_WAIT to APPLICATION_RESOURCES_CLEANINGUP
{noformat}



> NM recovery race can lead to container not cleaned up
> -----------------------------------------------------
>
>                 Key: YARN-4924
>                 URL: https://issues.apache.org/jira/browse/YARN-4924
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.0.0, 2.7.2
>            Reporter: Nathan Roberts
>
> It's probably a small window but we observed a case where the NM crashed and 
> then a container was not properly cleaned up during recovery.
> I will add details in first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to