[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993852#comment-13993852 ]
Karthik Kambatla commented on YARN-556: --------------------------------------- Anubhav, Bikas, Jian, Vinod and myself synced up offline to discuss some of the action items. The general feedback was that the prototype is mostly good, except for how we want to implement the scheduler changes. Control flow of the RM restart recovery process is to look something like this. Please feel free to question the correctness. # RM loads the state from the state store # RM starts all its internal services. # RM starts the scheduler, so it could reconstruct the state from node heartbeats. # RM waits for a configurable period of time to allow a sufficient fraction of nodes to rejoin. Individual schedulers decide on how to handle nodes rejoining after this configurable period and before they expire. The options are to kill containers running on the nodes that came in late or to allow them to continue running and update scheduler data-structures. It sounded like CapacityScheduler would prefer the former and FairScheduler the latter - this can still change later. # After the configurable wait-time, the RM starts accepting RPCs from both new AMs and already existing AMs. # Existing AMs are expected to {{*resync*}} with the RM, which essentially translates to {{register}} followed by an {{allocate}} call that sends all the outstanding requests. AMRMClient is to handle all this, so user AMs using this automatically benefit from this. Other related items: # Currently, NM information is not persisted. We should persist it and reload on restart. # Update the javadoc for RegisterApplicationMasterResponse#getContainersFromPreviousAttempts to say it only gives information at that snapshot. Info on running containers might keep trickling in even after the register - we might need something similar in AllocateResponse to get this information, we might also want to add a degree of confidence to this response that is a fraction of nodes that have reconnected. > RM Restart phase 2 - Work preserving restart > -------------------------------------------- > > Key: YARN-556 > URL: https://issues.apache.org/jira/browse/YARN-556 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager > Reporter: Bikas Saha > Assignee: Bikas Saha > Attachments: Work Preserving RM Restart.pdf, > WorkPreservingRestartPrototype.001.patch > > > YARN-128 covered storing the state needed for the RM to recover critical > information. This umbrella jira will track changes needed to recover the > running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.2#6252)