[ 
https://issues.apache.org/jira/browse/YARN-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503228#comment-13503228
 ] 

Tom White commented on YARN-230:
--------------------------------

Overall, this looks great. General feedback:

* Can we make application removal atomic? If the RM shuts down after a 
completed application is removed from the state store, but before the app 
attempts are removed from the store, then the app attempts may be orphaned. 
(There's a comment about it in FileSystemRMStateStore, but no action is taken 
so the attempt files will remain in the store.) It might be better to make 
RMStateStore#removeApplicationState responsible for removing the app attempts 
(i.e. remove removeApplicationAttemptState). This would solve the orphaning 
problem, and it would also make it possible to store the app attempts in a 
directory nested under the application directory, which would be nicer from a 
scaling point of view, and also for someone having to debug the state on the 
filesystem.
* If the RM shuts down before a (successful) completed application is removed 
from the state store, will it be rerun on restart, or will the fact that a 
successful app attempt was stored mean that it doesn't need to? Obviously, the 
second one would be preferable.
* The exceptions thrown by the public methods of RMStateStore should be more 
specific than Exception.
* Let's have a default for yarn.resourcemanager.store.class in 
yarn-default.xml. StoreFactory has MemoryRMStateStore as the default, but 
that's not useful when running on a cluster; FileSystemRMStateStore would be 
better. Similarly it would be good to have the default location for the store 
be a system directory on the default file system. With these two changes folks 
would only need to set yarn.resourcemanager.recovery.enabled to true to enable 
recovery. (We might make that enabled by default at some point too.)
* MemoryRMStateStore#removeApplicationState will fail if asserts are disabled: 
the remove method should be called in a separate statement and assigned to a 
variable which can be checked in the assert. It's worth checking if this 
problem exists elsewhere.
* Naming nit: Store was renamed to RMStateStore, but so StoreFactory should be 
renamed to RMStateStoreFactory.
* Naming nit: zk.rm-state-store rather than zk.rmstatestore for consistency 
with other property names. Also for fs.rmstatestore, and 
zk.rmstatestore.parentpath (parent-path).

                
> Make changes for RM restart phase 1
> -----------------------------------
>
>                 Key: YARN-230
>                 URL: https://issues.apache.org/jira/browse/YARN-230
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: PB-impl.patch, Recovery.patch, Store.patch, Test.patch, 
> YARN-230.1.patch
>
>
> As described in YARN-128, phase 1 of RM restart puts in place mechanisms to 
> save application state and read them back after restart. Upon restart, the 
> NM's are asked to reboot and the previously running AM's are restarted.
> After this is done, RM HA and work preserving restart can continue in 
> parallel. For more details please refer to the design document in YARN-128

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to