[jira] [Commented] (YARN-4273) Containers can be leaked due to race between application being killed and NM registering back after recovery

Jason Lowe (JIRA) Fri, 16 Oct 2015 07:35:44 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960791#comment-14960791
 ]


Jason Lowe commented on YARN-4273:
----------------------------------

Wait, isn't the RM supposed to be remembering to tell the NM about the 
application finishing?  When the NM registers it not only will say which 
containers are still running but also which applications are still active.  
When the application finishes (due to being killed or whatever) then the RM 
will tell the NM that the app is finished, either directly on the reconnect 
event because the app is unknown or from RMAppImpl when it gets the 
APP_RUNNING_ON_NODE event.  So I don't think these will be leaked forever.  
Eventually the app will finish and the NM will be told about it.  When the NM 
hears the app is finished, it will kill all containers belonging to that app.

> Containers can be leaked due to race between application being killed and NM 
> registering back after recovery
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4273
>                 URL: https://issues.apache.org/jira/browse/YARN-4273
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.1
>            Reporter: Varun Saxena
>            Assignee: Varun Saxena
>
> This issue is based on discussion on YARN-4000
> Consider this scenario : 
> 1) Application is recovered and added into scheduler, some slow NM has not 
> re-registered back, so those containers are not yet recovered.
> 2) User kills this app
> 3) CapacityScheduler#doneApplicationAttempt is called, containers tracked by 
> RM so far are killed. Note that CapacityScheduler#doneApplication is not 
> called, so scheduler still has the SchedulerApplication in memory
> 4) Slow NM now re-registers and try to recover the containers. If application 
> is set to keep containers across attempts, these containers will be recovered 
> even though application is in the process of being killed. These container 
> will not be killed later on. Hence, these containers are leaked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4273) Containers can be leaked due to race between application being killed and NM registering back after recovery

Reply via email to