[ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13740495#comment-13740495 ]
Karthik Kambatla commented on YARN-1055: ---------------------------------------- Thinking more about it, the issue is not limited to RM failure. This happens even in the case where a node running the launcher goes down. The underlying issue seems to be in handling the dependency between AMs and wanting to tolerate failures of some of these AMs and not others. Given that adding the config won't solve the issue completely, I agree that it is not a good idea to fix it for RM restart alone. Thanks Bikas, Vinod, Hitesh, Alejandro for the detailed discussion. The issue, however, exists with dependent AMs and need to be handled - may be in Ooize for now? In the long term, would it make any sense for YARN to support inter-dependent AMs? > Handle app recovery differently for AM failures and RM restart > -------------------------------------------------------------- > > Key: YARN-1055 > URL: https://issues.apache.org/jira/browse/YARN-1055 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Affects Versions: 2.1.0-beta > Reporter: Karthik Kambatla > > Ideally, we would like to tolerate container, AM, RM failures. App recovery > for AM and RM currently relies on the max-attempts config; tolerating AM > failures requires it to be > 1 and tolerating RM failure/restart requires it > to be = 1. > We should handle these two differently, with two separate configs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira