[ 
https://issues.apache.org/jira/browse/YARN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237129#comment-15237129
 ] 

Jason Lowe commented on YARN-2567:
----------------------------------

The problem with delaying or otherwise making the state store operations 
asynchronous with the state changes they are intended to record is it will 
always lead to inconsistent recovery if we fail between the state change and 
the state store operation.  IMHO we cannot let the NM registration complete or 
at least start using the node in a way that is inconsistent with the state as 
currently recorded in the state store until the state store operation 
completes.  So we might be able to let the node register, but we should not 
allocate and launch new containers on it until the state store update completes 
or we end up with the problem described above.

In general there needs to be a minimal performance expectation from the state 
store for a given cluster setup or the RM is going to do some bad things.  For 
example, we can't sustain a situation where applications are being submitted at 
a rate faster than we can record them to the state store.  Similarly for large 
clusters it's going to be problematic if a large network cut occurs and we need 
to record the expiration of 1000's of containers but can't do so in a 
reasonable timeframe.  If we tell applications that containers on those nodes 
are lost _before_ we record the lost node in the state store then if we 
failover before the node re-joins the new RM instance won't know it's supposed 
to kill the containers on the rejoining node.  AMs probably won't appreciate 
being told a container has completed only to have it keep running and count 
against their user limits/headroom in the future.  Therefore we have to record 
the node as lost in the state store before we inform AMs the containers and 
node are gone.


> Add a percentage-node threshold for RM to wait for new allocations after 
> restart/failover
> -----------------------------------------------------------------------------------------
>
>                 Key: YARN-2567
>                 URL: https://issues.apache.org/jira/browse/YARN-2567
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>
> This is the remaining part of YARN-2001 - to halt allocations after restart 
> till x% of nodes sync back with the RM. This is useful for avoiding bad 
> scheduling during the time the nodes are still joining back after a 
> restart/failover.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to