[ 
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236422#comment-15236422
 ] 

Robert Kanter commented on YARN-4676:
-------------------------------------

Sorry for taking so long to review the updated patch.  Thanks for doing that.  
I also deployed a small cluster with the patch and played around with it; 
everything seemed to work correctly!

Here's some more feedback on the 009 patch:
# I'm not sure why the unit tests are failing on Jenkins but not locally (they 
didn't fail for me either).  I'd normally write it off as unrelated, but it's 
happening consistently on Jenkins so I'm suspicious.  I'm not sure what to do 
about this.
# No need to remove the empty line from {{ClusterMetrics}}.  There's no other 
changes in that file.  Same with {{NodeRemovedSchedulerEvent}}.  These just 
create false changes.
# We should keep the {{resolver}} call in {{NodesListManager#isValidNode}}.  
The admin might have configured the {{CachedResolver}} instead of the 
{{DirectResolver}}.  Calling {{NetUtils}} directly effectively hardcodes the 
{{DirectResolver}} to always be used and ignores the 
{{yarn.resourcemanager.node-ip-cache.expiry-interval-secs}} config.
# In {{NodesListManager#handleExcludeNodeList}}, the recom case is the same 
regardless of {{graceful}}.  I think we can move that condition up so we don't 
have to repeat it.
# The existing code at the end of  {{RMAdminCLI#refreshNodes}} tries to take 
care of a case where nodes don't gracefully exit by the timeout by issuing a 
forceful decom.  Given that the RM is taking care of monitoring the decom 
status of each node in {{DecomissioningNodesWatcher}}, perhaps we should remove 
this?  I'm concerned about a race condition between the CLI waiting and the RM 
waiting.  The CLI should essentially be polling what the RM is doing here.
# {{DecommissioningNodesWatcher}} has a few places where it catches 
{{Exception}} and logs it's message.  Two things:
#- It's more helpful for debugging if the log message also prints out the 
stacktrace.  You can do this by passing the {{Exception}} as the last argument 
(e.g. {{LOG.info("foo", e)}})
#- The methods being called in the try block don't declare that they throw any 
Exceptions.  If these methods are expected to throw Exceptions, they should 
declare them.  We should be careful about blindly catching all undeclared 
Exceptions.  
# In {{DecommissioningNodesWatcher}}, shouldn't the {{update}} method be 
{{synchronized}}?
# Still need docs

> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> ----------------------------------------------------------------
>
>                 Key: YARN-4676
>                 URL: https://issues.apache.org/jira/browse/YARN-4676
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Daniel Zhi
>            Assignee: Daniel Zhi
>              Labels: features
>         Attachments: GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, 
> YARN-4676.005.patch, YARN-4676.006.patch, YARN-4676.007.patch, 
> YARN-4676.008.patch, YARN-4676.009.patch
>
>
> DecommissioningNodeWatcher inside ResourceTrackingService tracks 
> DECOMMISSIONING nodes status automatically and asynchronously after 
> client/admin made the graceful decommission request. It tracks 
> DECOMMISSIONING nodes status to decide when, after all running containers on 
> the node have completed, will be transitioned into DECOMMISSIONED state. 
> NodesListManager detect and handle include and exclude list changes to kick 
> out decommission or recommission as necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to