[ https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236422#comment-15236422 ]
Robert Kanter commented on YARN-4676: ------------------------------------- Sorry for taking so long to review the updated patch. Thanks for doing that. I also deployed a small cluster with the patch and played around with it; everything seemed to work correctly! Here's some more feedback on the 009 patch: # I'm not sure why the unit tests are failing on Jenkins but not locally (they didn't fail for me either). I'd normally write it off as unrelated, but it's happening consistently on Jenkins so I'm suspicious. I'm not sure what to do about this. # No need to remove the empty line from {{ClusterMetrics}}. There's no other changes in that file. Same with {{NodeRemovedSchedulerEvent}}. These just create false changes. # We should keep the {{resolver}} call in {{NodesListManager#isValidNode}}. The admin might have configured the {{CachedResolver}} instead of the {{DirectResolver}}. Calling {{NetUtils}} directly effectively hardcodes the {{DirectResolver}} to always be used and ignores the {{yarn.resourcemanager.node-ip-cache.expiry-interval-secs}} config. # In {{NodesListManager#handleExcludeNodeList}}, the recom case is the same regardless of {{graceful}}. I think we can move that condition up so we don't have to repeat it. # The existing code at the end of {{RMAdminCLI#refreshNodes}} tries to take care of a case where nodes don't gracefully exit by the timeout by issuing a forceful decom. Given that the RM is taking care of monitoring the decom status of each node in {{DecomissioningNodesWatcher}}, perhaps we should remove this? I'm concerned about a race condition between the CLI waiting and the RM waiting. The CLI should essentially be polling what the RM is doing here. # {{DecommissioningNodesWatcher}} has a few places where it catches {{Exception}} and logs it's message. Two things: #- It's more helpful for debugging if the log message also prints out the stacktrace. You can do this by passing the {{Exception}} as the last argument (e.g. {{LOG.info("foo", e)}}) #- The methods being called in the try block don't declare that they throw any Exceptions. If these methods are expected to throw Exceptions, they should declare them. We should be careful about blindly catching all undeclared Exceptions. # In {{DecommissioningNodesWatcher}}, shouldn't the {{update}} method be {{synchronized}}? # Still need docs > Automatic and Asynchronous Decommissioning Nodes Status Tracking > ---------------------------------------------------------------- > > Key: YARN-4676 > URL: https://issues.apache.org/jira/browse/YARN-4676 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Affects Versions: 2.8.0 > Reporter: Daniel Zhi > Assignee: Daniel Zhi > Labels: features > Attachments: GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, > YARN-4676.005.patch, YARN-4676.006.patch, YARN-4676.007.patch, > YARN-4676.008.patch, YARN-4676.009.patch > > > DecommissioningNodeWatcher inside ResourceTrackingService tracks > DECOMMISSIONING nodes status automatically and asynchronously after > client/admin made the graceful decommission request. It tracks > DECOMMISSIONING nodes status to decide when, after all running containers on > the node have completed, will be transitioned into DECOMMISSIONED state. > NodesListManager detect and handle include and exclude list changes to kick > out decommission or recommission as necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)