[ 
https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993852#comment-13993852
 ] 

Karthik Kambatla commented on YARN-556:
---------------------------------------

Anubhav, Bikas, Jian, Vinod and myself synced up offline to discuss some of the 
action items. The general feedback was that the prototype is mostly good, 
except for how we want to implement the scheduler changes. 

Control flow of the RM restart recovery process is to look something like this. 
Please feel free to question the correctness.
# RM loads the state from the state store
# RM starts all its internal services. 
# RM starts the scheduler, so it could reconstruct the state from node 
heartbeats.
# RM waits for a configurable period of time to allow a sufficient fraction of 
nodes to rejoin. Individual schedulers decide on how to handle nodes rejoining 
after this configurable period and before they expire. The options are to kill 
containers running on the nodes that came in late or to allow them to continue 
running and update scheduler data-structures. It sounded like CapacityScheduler 
would prefer the former and FairScheduler the latter - this can still change 
later.
# After the configurable wait-time, the RM starts accepting RPCs from both new 
AMs and already existing AMs.
# Existing AMs are expected to {{*resync*}} with the RM, which essentially 
translates to {{register}} followed by an {{allocate}} call that sends all the 
outstanding requests. AMRMClient is to handle all this, so user AMs using this 
automatically benefit from this.

Other related items:
# Currently, NM information is not persisted. We should persist it and reload 
on restart.
# Update the javadoc for 
RegisterApplicationMasterResponse#getContainersFromPreviousAttempts to say it 
only gives information at that snapshot. Info on running containers might keep 
trickling in even after the register - we might need something similar in 
AllocateResponse to get this information, we might also want to add a degree of 
confidence to this response that is a fraction of nodes that have reconnected. 

> RM Restart phase 2 - Work preserving restart
> --------------------------------------------
>
>                 Key: YARN-556
>                 URL: https://issues.apache.org/jira/browse/YARN-556
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: Work Preserving RM Restart.pdf, 
> WorkPreservingRestartPrototype.001.patch
>
>
> YARN-128 covered storing the state needed for the RM to recover critical 
> information. This umbrella jira will track changes needed to recover the 
> running state of the cluster so that work can be preserved across RM restarts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to