[ 
https://issues.apache.org/jira/browse/YARN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236724#comment-15236724
 ] 

sandflee commented on YARN-2567:
--------------------------------

there maybe one problem that if NM recovered as a finished state and NM 
register with running containers, normally we should kill the container. There 
may some problem as below:
1, NM LOST and RM store  LOST status successfully
2, RM failover and NM recovered as LOST
3, NM register and becomes RUNNING, {color:red} but RM stores RUNNING state 
failed or delayed{color}
4, RM allocate container on NM, and container running on it
5, RM failover and NM recovered as LOST
6, NM register with RM,  RM killed the container on it, this is not expected

to fix this , one solution is to store NM status first, then NM becomes 
RUNNING,  but this may delay the NM register for big cluster

> Add a percentage-node threshold for RM to wait for new allocations after 
> restart/failover
> -----------------------------------------------------------------------------------------
>
>                 Key: YARN-2567
>                 URL: https://issues.apache.org/jira/browse/YARN-2567
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>
> This is the remaining part of YARN-2001 - to halt allocations after restart 
> till x% of nodes sync back with the RM. This is useful for avoiding bad 
> scheduling during the time the nodes are still joining back after a 
> restart/failover.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to