[ 
https://issues.apache.org/jira/browse/YARN-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509153#comment-14509153
 ] 

Jason Lowe commented on YARN-3535:
----------------------------------

I think we need to fix the RMContainerImpl ALLOCATED to KILLED transition, but 
I think there's another bug here.  I believe the container was killed in the 
first place because the RMNodeImpl reconnect transition makes an assumption 
that is racy.  When the node reconnects, it checks if the node reports no 
applications running.  If it has no applications then it sends a removed node 
eventfollowed by a added node event to the scheduler.  This will cause the 
scheduler to kill all containers allocated on that node.  However the node will 
only know about a container iff the AM acquires the container and tries to 
launch the container on the node.  That can take minutes to transpire, so it's 
dangerous to assume that a node not reporting any applications on the node 
means it doesn't have anything pending.

I think we'll have to revisit the solution to YARN-2561 to either eliminate 
this race or make it safe if it does occur.  Ideally we shouldn't be sending a 
remove/add event to the scheduler if the node is reconnecting, but we need to 
make sure we cancel containers on the node that are no longer running.  Since 
the node reports what containers it has when it reconnects, it seems like we 
can convey that information to the scheduler to correct anything that doesn't 
match up.  Any container in the RUNNING state that no longer appears in the 
list of containers when registering can be killed by the scheduler, as it does 
when a node is removed, and I believe that will fix YARN-2561 and also avoid 
this race.

cc: [~djp] as this also has potential ramifications for graceful decommission.  
If we try to graceful decommission a node that isn't currently reporting 
applications we may also need to verify the scheduler hasn't allocated or 
handed out a container for that node that hasn't reached the node yet.

>  ResourceRequest should be restored back to scheduler when RMContainer is 
> killed at ALLOCATED
> ---------------------------------------------------------------------------------------------
>
>                 Key: YARN-3535
>                 URL: https://issues.apache.org/jira/browse/YARN-3535
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Peng Zhang
>            Assignee: Peng Zhang
>         Attachments: syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to