[ 
https://issues.apache.org/jira/browse/YARN-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260468#comment-15260468
 ] 

Junping Du commented on YARN-4676:
----------------------------------

bq. If RM work-preserving restart is not enabled, it should be okay to 
decommission a node right away. 
Agree. But it is not today's behavior w/o this patch. After this patch, the 
decommissioning nodes will lose timeout until all running applications on top 
of are get finished.

bq. If work-preserving restart is enabled and a node is decommissioned with a 
timeout, it would be nice to store when the decommission has been called and 
the timeout in the state-store. Note that, in an HA setup, the two RMs could 
have a clock skew. Since that work is non-trivial, I am open to doing it in a 
follow-up JIRA.
I really have concern to put everything into state-store. I think we should try 
to get rid of store unnecessary info as much as possible - just like what we do 
in RM recover applications/nodes for RM restart. Isn't it? Additional 
Store/Recovery operation for each NM's decommissioning timeout value sounds too 
over-weighted. 
Actually, I was more interested on the Daniel's idea above to combine the 
client side track and RM side track so that we could track timeout in client 
side in case we lose timeout in RM side. However, I need to check more to have 
some more concrete ideas.

> Automatic and Asynchronous Decommissioning Nodes Status Tracking
> ----------------------------------------------------------------
>
>                 Key: YARN-4676
>                 URL: https://issues.apache.org/jira/browse/YARN-4676
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Daniel Zhi
>            Assignee: Daniel Zhi
>              Labels: features
>         Attachments: GracefulDecommissionYarnNode.pdf, YARN-4676.004.patch, 
> YARN-4676.005.patch, YARN-4676.006.patch, YARN-4676.007.patch, 
> YARN-4676.008.patch, YARN-4676.009.patch, YARN-4676.010.patch, 
> YARN-4676.011.patch, YARN-4676.012.patch, YARN-4676.013.patch
>
>
> DecommissioningNodeWatcher inside ResourceTrackingService tracks 
> DECOMMISSIONING nodes status automatically and asynchronously after 
> client/admin made the graceful decommission request. It tracks 
> DECOMMISSIONING nodes status to decide when, after all running containers on 
> the node have completed, will be transitioned into DECOMMISSIONED state. 
> NodesListManager detect and handle include and exclude list changes to kick 
> out decommission or recommission as necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to