[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306602#comment-14306602
 ] 

Xuan Gong commented on YARN-914:
--------------------------------

Thanks for the proposal [~djp]

bq. RM in failed over (with HA enabled) when gracefully decommission is just 
triggered. We should make sure the new active RM can carry on the action 
forward (how to keep sync for decommissioned node list between active and 
standby RM?)

I believe this is about the configuration synchronization between multiple RM 
nodes. Please take a look at https://issues.apache.org/jira/browse/YARN-1666, 
and https://issues.apache.org/jira/browse/YARN-1611

bq. With containers of long running services, the timeout may not help but only 
delay the upgrade/reboot process. Shall we skip it and decommission directly in 
this case?

Do we really need to handle the "LRS containers" and "short-term containers" 
differently? There are lots of different cases we need to take care. I think 
that we can just use the same way to handle both.

bq. Another possibility is to track decommission timeout in RM side, instead of 
NM side ­ a new decommission services proposed above. Which way is better?

Maybe we need to track the timeout at RM side and NM side. RM can stop NM if 
the timeout is reached but it does not receive the "decommission complete" from 
NM.

> Support graceful decommission of nodemanager
> --------------------------------------------
>
>                 Key: YARN-914
>                 URL: https://issues.apache.org/jira/browse/YARN-914
>             Project: Hadoop YARN
>          Issue Type: Improvement
>    Affects Versions: 2.0.4-alpha
>            Reporter: Luke Lu
>            Assignee: Junping Du
>         Attachments: Gracefully Decommission of NodeManager (v1).pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to