[ 
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045959#comment-14045959
 ] 

Jason Lowe commented on YARN-1341:
----------------------------------

Agree it's not ideal to discuss handling state store errors for all NM 
components in this JIRA.  In general I'd prefer to discuss and address each 
case with the corresponding JIRA, e.g.: application state store errors 
discussed and addressed in YARN-1354, container state store errors in 
YARN-1337, etc.  If we feel there's significant utility to committing a JIRA 
before all the issues are addressed then we can file one or more followup JIRAs 
to track those outstanding issues.  That's the normal process we follow with 
other features/fixes as well.  

So if we follow that process then we're back to the discussion about RM master 
keys not being able to be stored in the state store.  The choices we've 
discussed are:

1) Log an error, update the master key in memory, and continue
2) Log an error, _not_ update the master key in memory, and continue
3) Log an error and tear down the NM

I'd prefer 1) since that is the option that preserves the most work in all 
scenarios I can think of, and I don't know of a scenario where 2) would handle 
it better.  However I could be convinced given the right scenario.  I'd really 
rather avoid 3) since that seems like a severe way to "handle" the error and 
guarantees work is lost.

Oh there is one more handling scenario we briefly discussed where we flag the 
NM as "undesirable".  When that occurs we don't shoot the containers that are 
running, but we avoid adding new containers since the node is having issues 
(i.e.: a drain-decommission).  I feel that would be a separate JIRA since it 
needs YARN-914, and we'd still need to decide how to handle the error until the 
decommission is complete (i.e.: choice 1 or 2 above).

> Recover NMTokens upon nodemanager restart
> -----------------------------------------
>
>                 Key: YARN-1341
>                 URL: https://issues.apache.org/jira/browse/YARN-1341
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to