[ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737651#comment-13737651 ]
Vinod Kumar Vavilapalli commented on YARN-1055: ----------------------------------------------- Irrespective of RM restart, any MR AM can have multiple copies of itself running at the same time - think of a network partition. We've done enough work in the recent past to avoid issues when concurrent AMs run for the same application. So strictly from MR AM point of view, this is not a problem even with RM restarts. The same work has to be done for all AMs, all YARN really cannot fix application issues in case of split brain problems due to network partitions. For the launcher point of view, clearly there is work needed in oozie side to make the launcher itself not restart jobs from scratch. But till that happens, oozie needs to set max-attempts as 1 for the launcher. > Handle app recovery differently for AM failures and RM restart > -------------------------------------------------------------- > > Key: YARN-1055 > URL: https://issues.apache.org/jira/browse/YARN-1055 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Affects Versions: 2.1.0-beta > Reporter: Karthik Kambatla > > Ideally, we would like to tolerate container, AM, RM failures. App recovery > for AM and RM currently relies on the max-attempts config; tolerating AM > failures requires it to be > 1 and tolerating RM failure/restart requires it > to be = 1. > We should handle these two differently, with two separate configs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira