[ 
https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903400#comment-14903400
 ] 

Varun Saxena commented on YARN-4000:
------------------------------------

bq. Is this the case? I think in current code, RM is still ignoring these 
orphan containers?
In recoverContainersOnNode, if we do not find application in scheduler the flow 
in RM if I look at trunk code is as under:
# AbstractYarnScheduler#killOrphanContainerOnNode will be called if application 
is not found in scheduler, which will in turn post CLEANUP_CONTAINER event (for 
containers which have not finished). This event will be handled by RMNodeImpl. 
Although here we will be sending one CLEANUP_CONTAINER event for each container 
even though all containers for a running app will have to be cleaned up. Maybe 
this can be refactored to send one event only with all the containers for an 
app and node. But cleaning up a lot of containers like this maybe a rare 
scenario.
# Anyways going further, in RMNodeImpl, this event will be processed in 
CleanUpContainerTransition. Here the container will be added to a set 
containersToClean.
# When heartbeat from NM comes, ResourceTrackerService#nodeHeartbeat will call 
RMNodeImpl#updateNodeHeartbeatResponseForCleanup. In this method, response will 
be populated with containers to cleanup from the set containersToClean. And 
hence these containers are reported back to NM in HB Rsp.

On NM side, flow is as under:
# In NodeStatusUpdaterImpl, these containers to cleanup will be retrieved from 
HB Rsp and CMgrCompletedContainersEvent will be dispatched.
# In ContainerManagerImpl, this event will be processed and a 
ContainerKillEvent created for each container. 
# Now depending on the state of the container, ContainerImpl will send a 
CLEANUP_CONTAINER event to ContainersLauncher which will then send a TERM/KILL 
signal to container. 

> RM crashes with NPE if leaf queue becomes parent queue during restart
> ---------------------------------------------------------------------
>
>                 Key: YARN-4000
>                 URL: https://issues.apache.org/jira/browse/YARN-4000
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler, resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-4000.01.patch, YARN-4000.02.patch, 
> YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch
>
>
> This is a similar situation to YARN-2308.  If an application is active in 
> queue A and then the RM restarts with a changed capacity scheduler 
> configuration where queue A becomes a parent queue to other subqueues then 
> the RM will crash with a NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to