[ 
https://issues.apache.org/jira/browse/YARN-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501361#comment-13501361
 ] 

Bikas Saha commented on YARN-230:
---------------------------------

Attaching 4 patches that break up the whole change for easy reviewing. They 
wont build on their own.
1) PB-impl.patch - Classes and PB impl for objects used to store Application 
and ApplicationAttempt data. Pretty straightforward code derived from similar 
code for other PB impls.
2) Store.patch - Implementation of RMStateStore abstract class that interfaces 
between RM and real store classes. It translates RM objects like RMAppImpl into 
PB objects like ApplicationStateData. It also provides a common implementation 
of blocking and non-blocking store functions. Non-blocking operations are 
performed using an AsyncDispatcher and RMStateStore events that eventually call 
implementations of abstract methods which will be provided by real stores. A 
memory store is implemented for testing.
3) Test.patch - 1 new test TestRMRestart is a functional test that takes the RM 
through storing and recovering state with applications in different states of 
execution. The flow should be easy to follow with comments. 
TestRMAppAttemptTransitions tests for the RMAppAttemptImpl state machine 
changes. Other changes are refactoring and addition of helper methods to MockRM 
etc.
4) Recovery.patch - Implements the proposal in the design doc. RM startup loads 
old state and recovers from it if recovery is enabled. Each app that is 
recovered is submitted to the RMAppManager so that it re-hydrates all 
references and tokens like it would normally and then is ready to start its 
next attempt. Each recovered application attempt is added to the app's attempts 
collection and moved to a RECOVERED state. After that the services, including 
the AsyncDispatcher is started which trigger creation of new attempts for the 
submitted apps. Existing code in ApplicationMasterService and 
ResourceTrackerService reboot NM's and previously running AM's. When a new app 
is submitted then before replying to the client its data is saved (blocking). 
Just before an attempt is launched, its data is saved (non-blocking). For 
non-blocking state store new RMAppAttempt states have been added for regular 
and unmanaged attempts. Once an app is finished, its data is removed.
                
> Make changes for RM restart phase 1
> -----------------------------------
>
>                 Key: YARN-230
>                 URL: https://issues.apache.org/jira/browse/YARN-230
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: YARN-230.1.patch
>
>
> As described in YARN-128, phase 1 of RM restart puts in place mechanisms to 
> save application state and read them back after restart. Upon restart, the 
> NM's are asked to reboot and the previously running AM's are restarted.
> After this is done, RM HA and work preserving restart can continue in 
> parallel. For more details please refer to the design document in YARN-128

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to