[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045959#comment-14045959 ]
Jason Lowe commented on YARN-1341: ---------------------------------- Agree it's not ideal to discuss handling state store errors for all NM components in this JIRA. In general I'd prefer to discuss and address each case with the corresponding JIRA, e.g.: application state store errors discussed and addressed in YARN-1354, container state store errors in YARN-1337, etc. If we feel there's significant utility to committing a JIRA before all the issues are addressed then we can file one or more followup JIRAs to track those outstanding issues. That's the normal process we follow with other features/fixes as well. So if we follow that process then we're back to the discussion about RM master keys not being able to be stored in the state store. The choices we've discussed are: 1) Log an error, update the master key in memory, and continue 2) Log an error, _not_ update the master key in memory, and continue 3) Log an error and tear down the NM I'd prefer 1) since that is the option that preserves the most work in all scenarios I can think of, and I don't know of a scenario where 2) would handle it better. However I could be convinced given the right scenario. I'd really rather avoid 3) since that seems like a severe way to "handle" the error and guarantees work is lost. Oh there is one more handling scenario we briefly discussed where we flag the NM as "undesirable". When that occurs we don't shoot the containers that are running, but we avoid adding new containers since the node is having issues (i.e.: a drain-decommission). I feel that would be a separate JIRA since it needs YARN-914, and we'd still need to decide how to handle the error until the decommission is complete (i.e.: choice 1 or 2 above). > Recover NMTokens upon nodemanager restart > ----------------------------------------- > > Key: YARN-1341 > URL: https://issues.apache.org/jira/browse/YARN-1341 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Affects Versions: 2.3.0 > Reporter: Jason Lowe > Assignee: Jason Lowe > Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, > YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)