[ https://issues.apache.org/jira/browse/YARN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237129#comment-15237129 ]
Jason Lowe commented on YARN-2567: ---------------------------------- The problem with delaying or otherwise making the state store operations asynchronous with the state changes they are intended to record is it will always lead to inconsistent recovery if we fail between the state change and the state store operation. IMHO we cannot let the NM registration complete or at least start using the node in a way that is inconsistent with the state as currently recorded in the state store until the state store operation completes. So we might be able to let the node register, but we should not allocate and launch new containers on it until the state store update completes or we end up with the problem described above. In general there needs to be a minimal performance expectation from the state store for a given cluster setup or the RM is going to do some bad things. For example, we can't sustain a situation where applications are being submitted at a rate faster than we can record them to the state store. Similarly for large clusters it's going to be problematic if a large network cut occurs and we need to record the expiration of 1000's of containers but can't do so in a reasonable timeframe. If we tell applications that containers on those nodes are lost _before_ we record the lost node in the state store then if we failover before the node re-joins the new RM instance won't know it's supposed to kill the containers on the rejoining node. AMs probably won't appreciate being told a container has completed only to have it keep running and count against their user limits/headroom in the future. Therefore we have to record the node as lost in the state store before we inform AMs the containers and node are gone. > Add a percentage-node threshold for RM to wait for new allocations after > restart/failover > ----------------------------------------------------------------------------------------- > > Key: YARN-2567 > URL: https://issues.apache.org/jira/browse/YARN-2567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Vinod Kumar Vavilapalli > Assignee: Vinod Kumar Vavilapalli > > This is the remaining part of YARN-2001 - to halt allocations after restart > till x% of nodes sync back with the RM. This is useful for avoiding bad > scheduling during the time the nodes are still joining back after a > restart/failover. -- This message was sent by Atlassian JIRA (v6.3.4#6332)