[ https://issues.apache.org/jira/browse/YARN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236724#comment-15236724 ]
sandflee commented on YARN-2567: -------------------------------- there maybe one problem that if NM recovered as a finished state and NM register with running containers, normally we should kill the container. There may some problem as below: 1, NM LOST and RM store LOST status successfully 2, RM failover and NM recovered as LOST 3, NM register and becomes RUNNING, {color:red} but RM stores RUNNING state failed or delayed{color} 4, RM allocate container on NM, and container running on it 5, RM failover and NM recovered as LOST 6, NM register with RM, RM killed the container on it, this is not expected to fix this , one solution is to store NM status first, then NM becomes RUNNING, but this may delay the NM register for big cluster > Add a percentage-node threshold for RM to wait for new allocations after > restart/failover > ----------------------------------------------------------------------------------------- > > Key: YARN-2567 > URL: https://issues.apache.org/jira/browse/YARN-2567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Vinod Kumar Vavilapalli > Assignee: Vinod Kumar Vavilapalli > > This is the remaining part of YARN-2001 - to halt allocations after restart > till x% of nodes sync back with the RM. This is useful for avoiding bad > scheduling during the time the nodes are still joining back after a > restart/failover. -- This message was sent by Atlassian JIRA (v6.3.4#6332)