[ https://issues.apache.org/jira/browse/YARN-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13741165#comment-13741165 ]
Bikas Saha commented on YARN-1055: ---------------------------------- First of all, like folks have already agreed. This is fundamentally an Oozie problem. I dont want to add an option to YARN that does not make sense for YARN by itself. If YARN needs to hack a workaround to fix an Oozie problem, I would also like to see what Oozie is doing on its part of the bargain. What is the Oozie jira that fixes this fundamental problem with Oozie? With the correct settings, this may be a problem only on rare occasions for an Oozie workflow when an action-am node crashes. IMO its an ok compromise for the short term while YARN is still not GA. This issue exists since YARN started and since we started working on RM restart. If it hasnt been a catastrophic issue till now then IMO it can wait for some more time till we complete YARN-556. RM restart is work in active progress and I dont understand why we need to hack an API together when we are already tracking a proper solution in YARN-556. YARN and Hadoop-1 are different enough that 1-1 regression matching may not always make sense. Even when it does, it will be a regression only when YARN goes GA. Until then all of this is work in progress and users need to be aware of limitations that are known and being fixed. The cornerstone of the beta release that we all have worked so hard for is making a viable and stable API that we want to support. Adding a short term API would go against the basic premise of the beta release. Any workaround stop gap etc requires code change and maintenance of that code for future code changes. The request here is for an additional API in AppSubmissionContext that helps Oozie work around its lack of book-keeping. Once YARN goes out with beta then this API will have to be maintained forever since removing an API is backwards incompatible. Given that we are already committed to fixing this via YARN-556, adding a short term API that will need to be maintained forever is a disaster and I dont see enough value being added to suffer through it. We are better off not spending more time on this and devoting that energy on things like YARN-556 that make real improvements for everyone. I really hope this clarifies my position and assures you that we are committed solving the problem in the correct manner. > Handle app recovery differently for AM failures and RM restart > -------------------------------------------------------------- > > Key: YARN-1055 > URL: https://issues.apache.org/jira/browse/YARN-1055 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Affects Versions: 2.1.0-beta > Reporter: Karthik Kambatla > > Ideally, we would like to tolerate container, AM, RM failures. App recovery > for AM and RM currently relies on the max-attempts config; tolerating AM > failures requires it to be > 1 and tolerating RM failure/restart requires it > to be = 1. > We should handle these two differently, with two separate configs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira