[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

Vinod Kumar Vavilapalli (JIRA) Mon, 12 Aug 2013 17:48:38 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737651#comment-13737651
 ]


Vinod Kumar Vavilapalli commented on YARN-1055:
-----------------------------------------------

Irrespective of RM restart, any MR AM can have multiple copies of itself 
running at the same time - think of a network partition. We've done enough work 
in the recent past to avoid issues when concurrent AMs run for the same 
application. So strictly from MR AM point of view, this is not a problem even 
with RM restarts. The same work has to be done for all AMs, all YARN really 
cannot fix application issues in case of split brain problems due to network 
partitions.

For the launcher point of view, clearly there is work needed in oozie side to 
make the launcher itself not restart jobs from scratch. But till that happens, 
oozie needs to set max-attempts as 1 for the launcher.
                
> Handle app recovery differently for AM failures and RM restart
> --------------------------------------------------------------
>
>                 Key: YARN-1055
>                 URL: https://issues.apache.org/jira/browse/YARN-1055
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery 
> for AM and RM currently relies on the max-attempts config; tolerating AM 
> failures requires it to be > 1 and tolerating RM failure/restart requires it 
> to be = 1.
> We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

Reply via email to