[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

Bikas Saha (JIRA) Thu, 15 Aug 2013 09:47:13 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13741165#comment-13741165
 ]


Bikas Saha commented on YARN-1055:
----------------------------------

First of all, like folks have already agreed. This is fundamentally an Oozie 
problem.  I dont want to add an option to YARN that does not make sense for 
YARN by itself. If YARN needs to hack a workaround to fix an Oozie problem, I 
would also like to see what Oozie is doing on its part of the bargain. What is 
the Oozie jira that fixes this fundamental problem with Oozie?

With the correct settings, this may be a problem only on rare occasions for an 
Oozie workflow when an action-am node crashes. IMO its an ok compromise for the 
short term while YARN is still not GA.

This issue exists since YARN started and since we started working on RM 
restart. If it hasnt been a catastrophic issue till now then IMO it can wait 
for some more time till we complete YARN-556. RM restart is work in active 
progress and I dont understand why we need to hack an API together when we are 
already tracking a proper solution in YARN-556. YARN and Hadoop-1 are different 
enough that 1-1 regression matching may not always make sense. Even when it 
does, it will be a regression only when YARN goes GA. Until then all of this is 
work in progress and users need to be aware of limitations that are known and 
being fixed. The cornerstone of the beta release that we all have worked so 
hard for is making a viable and stable API that we want to support. Adding a 
short term API would go against the basic premise of the beta release.

Any workaround stop gap etc requires code change and maintenance of that code 
for future code changes. The request here is for an additional API in 
AppSubmissionContext that helps Oozie work around its lack of book-keeping. 
Once YARN goes out with beta then this API will have to be maintained forever 
since removing an API is backwards incompatible. Given that we are already 
committed to fixing this via YARN-556, adding a short term API that will need 
to be maintained forever is a disaster and I dont see enough value being added 
to suffer through it. We are better off not spending more time on this and 
devoting that energy on things like YARN-556 that make real improvements for 
everyone.

I really hope this clarifies my position and assures you that we are committed 
solving the problem in the correct manner.
                
> Handle app recovery differently for AM failures and RM restart
> --------------------------------------------------------------
>
>                 Key: YARN-1055
>                 URL: https://issues.apache.org/jira/browse/YARN-1055
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.1.0-beta
>            Reporter: Karthik Kambatla
>
> Ideally, we would like to tolerate container, AM, RM failures. App recovery 
> for AM and RM currently relies on the max-attempts config; tolerating AM 
> failures requires it to be > 1 and tolerating RM failure/restart requires it 
> to be = 1.
> We should handle these two differently, with two separate configs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-1055) Handle app recovery differently for AM failures and RM restart

Reply via email to