[ https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903400#comment-14903400 ]
Varun Saxena commented on YARN-4000: ------------------------------------ bq. Is this the case? I think in current code, RM is still ignoring these orphan containers? In recoverContainersOnNode, if we do not find application in scheduler the flow in RM if I look at trunk code is as under: # AbstractYarnScheduler#killOrphanContainerOnNode will be called if application is not found in scheduler, which will in turn post CLEANUP_CONTAINER event (for containers which have not finished). This event will be handled by RMNodeImpl. Although here we will be sending one CLEANUP_CONTAINER event for each container even though all containers for a running app will have to be cleaned up. Maybe this can be refactored to send one event only with all the containers for an app and node. But cleaning up a lot of containers like this maybe a rare scenario. # Anyways going further, in RMNodeImpl, this event will be processed in CleanUpContainerTransition. Here the container will be added to a set containersToClean. # When heartbeat from NM comes, ResourceTrackerService#nodeHeartbeat will call RMNodeImpl#updateNodeHeartbeatResponseForCleanup. In this method, response will be populated with containers to cleanup from the set containersToClean. And hence these containers are reported back to NM in HB Rsp. On NM side, flow is as under: # In NodeStatusUpdaterImpl, these containers to cleanup will be retrieved from HB Rsp and CMgrCompletedContainersEvent will be dispatched. # In ContainerManagerImpl, this event will be processed and a ContainerKillEvent created for each container. # Now depending on the state of the container, ContainerImpl will send a CLEANUP_CONTAINER event to ContainersLauncher which will then send a TERM/KILL signal to container. > RM crashes with NPE if leaf queue becomes parent queue during restart > --------------------------------------------------------------------- > > Key: YARN-4000 > URL: https://issues.apache.org/jira/browse/YARN-4000 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler, resourcemanager > Affects Versions: 2.6.0 > Reporter: Jason Lowe > Assignee: Varun Saxena > Attachments: YARN-4000.01.patch, YARN-4000.02.patch, > YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch > > > This is a similar situation to YARN-2308. If an application is active in > queue A and then the RM restarts with a changed capacity scheduler > configuration where queue A becomes a parent queue to other subqueues then > the RM will crash with a NullPointerException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)