[ 
https://issues.apache.org/jira/browse/YARN-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2331:
-----------------------------
    Attachment: YARN-2331v3.patch

Updated patch to trunk.

bq. Probably, we could set the default value for 
yarn.nodemanager.recovery.supervised as true. Normally, when people add a node 
as NM, they expect to use this node for a long time. So, restart is expected ?

The problem is if the NM is not being supervised then when it goes down there 
isn't going to be a timely restart.  That will leave containers unmanaged on 
the node (e.g.: can't be killed by YARN since NM is down).  The user may 
eventually get around to restarting the NM, but if that takes hours or days 
that doesn't help so much.

Before NM restart, the NM would try to kill all active containers on shutdown 
to prevent this.  With restart this is undesireable _unless_ the NM is going 
down and isn't going to be started in a timely manner (i.e.: this isn't a 
upgrade or NM isn't being supervised).

> Distinguish shutdown during supervision vs. shutdown for rolling upgrade
> ------------------------------------------------------------------------
>
>                 Key: YARN-2331
>                 URL: https://issues.apache.org/jira/browse/YARN-2331
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-2331.patch, YARN-2331v2.patch, YARN-2331v3.patch
>
>
> When the NM is shutting down with restart support enabled there are scenarios 
> we'd like to distinguish and behave accordingly:
> # The NM is running under supervision.  In that case containers should be 
> preserved so the automatic restart can recover them.
> # The NM is not running under supervision and a rolling upgrade is not being 
> performed.  In that case the shutdown should kill all containers since it is 
> unlikely the NM will be restarted in a timely manner to recover them.
> # The NM is not running under supervision and a rolling upgrade is being 
> performed.  In that case the shutdown should not kill all containers since a 
> restart is imminent due to the rolling upgrade and the containers will be 
> recovered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to