[ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985709#comment-13985709 ]
Bikas Saha commented on YARN-2001: ---------------------------------- Requiring all NM's to re-register might to too constraining because after a full code rollout, it may be common for some NM's to not come back. If the RM gets stuck for a minority of NM's not re-registering then it would effectively be loss of HA. I like the idea of waiting for a time period before considering the cluster fully up. However this timeout has to be small or else we will have a lot of downtime. Can this timeout be less than the AM liveliness period? If not then how do we treat AMs that are running on NM's that have not re-registered within the NM timeout? > Persist NMs info for RM restart > ------------------------------- > > Key: YARN-2001 > URL: https://issues.apache.org/jira/browse/YARN-2001 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Jian He > Assignee: Jian He > > RM should not accept allocate requests from AMs until all the NMs have > registered with RM. For that, RM needs to remember the previous NMs and wait > for all the NMs to register. > This is also useful for remembering decommissioned nodes across restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)