[ 
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13740495#comment-13740495
 ] 

Karthik Kambatla commented on YARN-1055:
----------------------------------------

Thinking more about it, the issue is not limited to RM failure. This happens 
even in the case where a node running the launcher goes down. The underlying 
issue seems to be in handling the dependency between AMs and wanting to 
tolerate failures of some of these AMs and not others.

Given that adding the config won't solve the issue completely, I agree that it 
is not a good idea to fix it for RM restart alone. Thanks Bikas, Vinod, Hitesh, 
Alejandro for the detailed discussion.

The issue, however, exists with dependent AMs and need to be handled - may be 
in Ooize for now? In the long term, would it make any sense for YARN to support 
inter-dependent AMs?

                
> Handle app recovery differently for AM failures and RM restart
> --------------------------------------------------------------
>
>                 Key: YARN-1055
>                 URL: https://issues.apache.org/jira/browse/YARN-1055
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery 
> for AM and RM currently relies on the max-attempts config; tolerating AM 
> failures requires it to be > 1 and tolerating RM failure/restart requires it 
> to be = 1.
> We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to