[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception

Wilfred Spiegelenburg (Jira) Tue, 14 Jan 2020 14:10:14 -0800


    [ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015435#comment-17015435
 ]


Wilfred Spiegelenburg commented on YARN-7913:
---------------------------------------------

Thank you [~snemeth] for looking at the patch
1) yes it is a different approach: we now do a proper restore instead of 
failing the app and or crashing the RM

2) removed that line it does not make sense to throw errors and fail the app 
but not fail the RM

3) You are correct in your assumption about the flag. It is checked in the case 
you are mentioning. This is the full change in that area:
{code}
487             if (!isAppRecovering) {
488               rejectApplicationWithMessage(applicationId,
489                   queueName + " is not a leaf queue");
490               return;
491             }
492             // app is recovering we do not want to fail the app now as it 
was there
493             // before we started the recovery. Add it to the recovery queue:
494             // dynamic queue directly under root, no ACL needed (auto clean 
up)
495             queueName = "root.recovery";
496             queue = queueMgr.getLeafQueue(queueName, true, applicationId);
{code}
The flag is already checked in line 487 of the change. When we get to the 
create of the recovery queue (line 492-496) we are recovering, if we are not 
recovering we have rejected the app and left the method using the return in 
line 490.

4) will fix the comment

5) Yes we only want to check newly added apps that are _not_ recovering. When 
we recover there is a large chance that there are no NMs registered. When we 
use the % resource setup for the maximum size of a queue then the size check of 
the AM resource will fail. Since the AM was/is already running we need to only 
perform that check if we are not recovering the app.

6) The largest number of statements inside an if..else I can see is 6 lines. 
Not sure if refactoring the method will make it any clearer. I also don't see 
any consistency in the checks that would allow us to change the flow easily and 
improve clarity.

> Improve error handling when application recovery fails with exception
> ---------------------------------------------------------------------
>
>                 Key: YARN-7913
>                 URL: https://issues.apache.org/jira/browse/YARN-7913
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 3.0.0
>            Reporter: Gergo Repas
>            Assignee: Wilfred Spiegelenburg
>            Priority: Major
>         Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception

Reply via email to