[ 
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037718#comment-14037718
 ] 

Jason Lowe commented on YARN-1341:
----------------------------------

As I was stating earlier, I personally would rather have the NM try to keep 
going.  Restarts should be rare, and I'd rather not force a loss of work by 
taking the NM down instantly when the state store hiccups.  Like you said, it's 
"cheaper" to lose an NM than an RM, which means the ramifications of not being 
able to recover the work on a restart is cheaper than it is for an RM.  If the 
state store is missing some things, we might not be able to recover a localized 
resource, a token, a container, or possibly anything at all.  If we aren't able 
to recover anything then the containers will fail to be reported to the RM when 
the NM registers, and the RM can act accordingly (i.e.: notify the AM that the 
containers are lost).  Or in the worst-case, the state store is so corrupted on 
startup that we don't even survive the NM restart and the NM crashes, which 
would have an end result just like if we took it down when the state store 
failed.

Therefore I'd rather not guarantee that we'll lose work by crashing the NM on 
any store error and instead try to preserve the work we have.  The NM could 
theoretically recover (e.g.: if the error is transient then the next RM key 
store could succeed).  If we take the NM down immediately then we're 
guaranteeing the work is lost.  Is that really better?

Maybe a better approach is to have errors like this trigger an unhealthy state 
for the NM when we have the ability to do a graceful decommission.  Then the RM 
can stop allocating new containers on the node but we can try to finish up the 
containers that are still running on the node.  (Maybe have a configurable 
timeout where at some point where we will take down the NM even if there are 
still active containers.)  Then we can declare the node is acting strangely, 
don't put any more load on it, but we won't destroy work on it instantly when 
the error occurs.  That is outside the scope of this JIRA, but it would allow 
us to react to the error while still trying to avoid destroying work in the 
process.

> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>
>                 Key: YARN-1341
>                 URL: https://issues.apache.org/jira/browse/YARN-1341
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to