[ https://issues.apache.org/jira/browse/YARN-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159488#comment-15159488 ]
Ray Chiang commented on YARN-3607: ---------------------------------- Two suggestions: 1) Since this is a setting that affects all daemons, it makes sense to have one setting per daemon type, such as yarn.resourcemanager.fail-fast and yarn.nodemanager.fail-fast. 2) There is going to be a lot of places in the YARN code where this variable could be checked. I'm thinking the first task/subtask would be to just add the variable definitions now and then let the functionality be added where it's appropriate. > Allow users to choose between failing the daemons vs failing the > apps/containers > -------------------------------------------------------------------------------- > > Key: YARN-3607 > URL: https://issues.apache.org/jira/browse/YARN-3607 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager, scheduler > Affects Versions: 2.7.0 > Reporter: Karthik Kambatla > Assignee: Ray Chiang > > We often run into cases where we are faced with the option of failing the > daemon (fail-fast) vs failing user's work and keep the cluster running. There > is no clear right way to handle these situations - some users would like to > be conservative and let the daemons run, while others would like to > fail-fast. > Today, we handle these case-by-case and go by what the people working on it > feel is the right way to handle things. Examples include how we handle app > recovery failures, queue-changes on RM restart. > Users should be able to choose between these two extremes, and have all these > situations handled the same way. -- This message was sent by Atlassian JIRA (v6.3.4#6332)