[ https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260441#comment-15260441 ]
Karthik Kambatla commented on YARN-4676: ---------------------------------------- Haven't looked at the code itself, but looked at recent discussion around RM restart and [~rkanter] filled me in on some of the details. If RM work-preserving restart is not enabled, it should be okay to decommission a node right away. If work-preserving restart is enabled and a node is decommissioned with a timeout, it would be nice to store *when* the decommission has been called and the timeout in the state-store. Note that, in an HA setup, the two RMs could have a clock skew. Since that work is non-trivial, I am open to doing it in a follow-up JIRA. > Automatic and Asynchronous Decommissioning Nodes Status Tracking > ---------------------------------------------------------------- > > Key: YARN-4676 > URL: https://issues.apache.org/jira/browse/YARN-4676 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Affects Versions: 2.8.0 > Reporter: Daniel Zhi > Assignee: Daniel Zhi > Labels: features > Attachments: GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, > YARN-4676.005.patch, YARN-4676.006.patch, YARN-4676.007.patch, > YARN-4676.008.patch, YARN-4676.009.patch, YARN-4676.010.patch, > YARN-4676.011.patch, YARN-4676.012.patch, YARN-4676.013.patch > > > DecommissioningNodeWatcher inside ResourceTrackingService tracks > DECOMMISSIONING nodes status automatically and asynchronously after > client/admin made the graceful decommission request. It tracks > DECOMMISSIONING nodes status to decide when, after all running containers on > the node have completed, will be transitioned into DECOMMISSIONED state. > NodesListManager detect and handle include and exclude list changes to kick > out decommission or recommission as necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)