[ https://issues.apache.org/jira/browse/YARN-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503228#comment-13503228 ]
Tom White commented on YARN-230: -------------------------------- Overall, this looks great. General feedback: * Can we make application removal atomic? If the RM shuts down after a completed application is removed from the state store, but before the app attempts are removed from the store, then the app attempts may be orphaned. (There's a comment about it in FileSystemRMStateStore, but no action is taken so the attempt files will remain in the store.) It might be better to make RMStateStore#removeApplicationState responsible for removing the app attempts (i.e. remove removeApplicationAttemptState). This would solve the orphaning problem, and it would also make it possible to store the app attempts in a directory nested under the application directory, which would be nicer from a scaling point of view, and also for someone having to debug the state on the filesystem. * If the RM shuts down before a (successful) completed application is removed from the state store, will it be rerun on restart, or will the fact that a successful app attempt was stored mean that it doesn't need to? Obviously, the second one would be preferable. * The exceptions thrown by the public methods of RMStateStore should be more specific than Exception. * Let's have a default for yarn.resourcemanager.store.class in yarn-default.xml. StoreFactory has MemoryRMStateStore as the default, but that's not useful when running on a cluster; FileSystemRMStateStore would be better. Similarly it would be good to have the default location for the store be a system directory on the default file system. With these two changes folks would only need to set yarn.resourcemanager.recovery.enabled to true to enable recovery. (We might make that enabled by default at some point too.) * MemoryRMStateStore#removeApplicationState will fail if asserts are disabled: the remove method should be called in a separate statement and assigned to a variable which can be checked in the assert. It's worth checking if this problem exists elsewhere. * Naming nit: Store was renamed to RMStateStore, but so StoreFactory should be renamed to RMStateStoreFactory. * Naming nit: zk.rm-state-store rather than zk.rmstatestore for consistency with other property names. Also for fs.rmstatestore, and zk.rmstatestore.parentpath (parent-path). > Make changes for RM restart phase 1 > ----------------------------------- > > Key: YARN-230 > URL: https://issues.apache.org/jira/browse/YARN-230 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Bikas Saha > Assignee: Bikas Saha > Attachments: PB-impl.patch, Recovery.patch, Store.patch, Test.patch, > YARN-230.1.patch > > > As described in YARN-128, phase 1 of RM restart puts in place mechanisms to > save application state and read them back after restart. Upon restart, the > NM's are asked to reboot and the previously running AM's are restarted. > After this is done, RM HA and work preserving restart can continue in > parallel. For more details please refer to the design document in YARN-128 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira