[jira] [Comment Edited] (YARN-7913) Improve error handling when application recovery fails with exception

Szilard Nemeth (Jira) Tue, 14 Jan 2020 03:07:29 -0800


    [ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015009#comment-17015009
 ]


Szilard Nemeth edited comment on YARN-7913 at 1/14/20 11:06 AM:
----------------------------------------------------------------

Hi [~wilfreds],

Has read through all comments and checked your patch.
1. IIUC, your approach is a new one compared to Gergo's previous PoC patch, 
right?

2. If I was just checking the description: "The point of this ticket is to 
improve the error handling and reduce the number of passive -> active RM 
transition attempts (solving the above described failure scenario isn't in 
scope).", it was not in-line with the patch you uploaded.
I think it would be better to remove this statement or add something to the 
desription that describes what has been done recently with your patch. This 
way, we could make a better experience of future readers of this jira.
What do you think?


3. FairScheduler#addApplication is invoked by processing the 
SchedulerEventType#APP_ADDED event. This event type is used by 
AppAddedSchedulerEvent. 
The calls to constructors of this event class from RMAppImpl are happening when 
an application is added or recovered. I can see the following calls in 
RmAppImpl: 

{code}
//1. 
app.handler.handle(
          new AppAddedSchedulerEvent(app.user, app.submissionContext, false,
              app.applicationPriority, app.placementContext));

//2. 
app.scheduler.handle(
            new AppAddedSchedulerEvent(app.user, app.submissionContext, false,
                app.applicationPriority, app.placementContext));

//3. 
 // Add application to scheduler synchronously to guarantee scheduler
      // knows applications before AM or NM re-registers.
      app.scheduler.handle(
          new AppAddedSchedulerEvent(app.user, app.submissionContext, true,
              app.applicationPriority, app.placementContext));
{code}

The boolean parameter is set to isAppRecovering field of the event, so this is 
why I came to the conclusion that apps being created or recovered are both sent 
to the event dispatcher.

So based on the above statement, the code you added to 
FairScheduler#addApplication seems a bit incorrect: 
{code}

// app is recovering we do not want to fail the app now as it was there
        // before we started the recovery. Add it to the recovery queue:
        // dynamic queue directly under root, no ACL needed (auto clean up)
        queueName = "root.recovery";
        queue = queueMgr.getLeafQueue(queueName, true, applicationId);

{code}

Shouldn't we need to check the isAppRecovering flag and only run this code 
block if the flag is true, meaning the app is recovering?


4. Nit: The comment has a missing comma at least: 
{code}
// Skip ACL check for recovering applications: they have been accepted
// in the queue already recovery should not break that.
{code}

5. One thing I don't fully understand: 
{code}
// when recovering the NMs might not have registered and we could have
// no resources in the queue, the app is already running and has thus
// passed all these checks, skip them now.
if (!isAppRecovering && rmApp != null &&
          rmApp.getAMResourceRequests() != null) {
          ....
{code}
Is this condition for checking the apps that are NOT recovering?

6. I can imagine some improvement that could be filed as a follow-up jira: I 
was having a bit hard time to have an understanding of the conditions inside 
FairScheduler#addApplication. 
For example, the statements inside the if-else conditions could be extracted to 
methods with meaningful names, so code readers would understand what happens, 
right away by reading the code. If they are interested in details of a 
particular case, they could open the method anyway. Can you file a jira for 
this?

thanks!



was (Author: snemeth):
Hi [~wilfreds],

Has read through all comments and checked your patch.
1. IIUC, your approach is a new one compared to Gergo's previous PoC patch, 
right?

2. If I was just checking the description: "The point of this ticket is to 
improve the error handling and reduce the number of passive -> active RM 
transition attempts (solving the above described failure scenario isn't in 
scope).", it was not in-line with the patch you uploaded.
I think it would be better to remove this statement or add something to the 
desription that describes what has been done recently with your patch. This 
way, we could make a better experience of future readers of this jira.
What do you think?


3. FairScheduler#addApplication is invoked by processing the 
SchedulerEventType#APP_ADDED event. This event type is used by 
AppAddedSchedulerEvent. 
The calls to constructors of this event class from RMAppImpl are happening when 
an application is added or recovered. I can see the following calls in 
RmAppImpl: 

{code}
1. 
app.handler.handle(
          new AppAddedSchedulerEvent(app.user, app.submissionContext, false,
              app.applicationPriority, app.placementContext));

2. app.scheduler.handle(
            new AppAddedSchedulerEvent(app.user, app.submissionContext, false,
                app.applicationPriority, app.placementContext));

3. 
 // Add application to scheduler synchronously to guarantee scheduler
      // knows applications before AM or NM re-registers.
      app.scheduler.handle(
          new AppAddedSchedulerEvent(app.user, app.submissionContext, true,
              app.applicationPriority, app.placementContext));
{code}

The boolean parameter is set to isAppRecovering field of the event, so this is 
why I came to the conclusion that apps being created or recovered are both sent 
to the event dispatcher.

So based on the above statement, the code you added to 
FairScheduler#addApplication seems a bit incorrect: 
{code}

// app is recovering we do not want to fail the app now as it was there
        // before we started the recovery. Add it to the recovery queue:
        // dynamic queue directly under root, no ACL needed (auto clean up)
        queueName = "root.recovery";
        queue = queueMgr.getLeafQueue(queueName, true, applicationId);

{code}

Shouldn't we need to check the isAppRecovering flag and only run this code 
block if the flag is true, meaning the app is recovering?


4. Nit: The comment has a missing comma at least: 
{code}
// Skip ACL check for recovering applications: they have been accepted
// in the queue already recovery should not break that.
{code}

5. One thing I don't fully understand: 
{code}
// when recovering the NMs might not have registered and we could have
// no resources in the queue, the app is already running and has thus
// passed all these checks, skip them now.
if (!isAppRecovering && rmApp != null &&
          rmApp.getAMResourceRequests() != null) {
          ....
{code}
Is this condition for checking the apps that are NOT recovering?

6. I can imagine some improvement that could be filed as a follow-up jira: I 
was having a bit hard time to have an understanding of the conditions inside 
FairScheduler#addApplication. 
For example, the statements inside the if-else conditions could be extracted to 
methods with meaningful names, so code readers would understand what happens, 
right away by reading the code. If they are interested in details of a 
particular case, they could open the method anyway. Can you file a jira for 
this?

thanks!


> Improve error handling when application recovery fails with exception
> ---------------------------------------------------------------------
>
>                 Key: YARN-7913
>                 URL: https://issues.apache.org/jira/browse/YARN-7913
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 3.0.0
>            Reporter: Gergo Repas
>            Assignee: Wilfred Spiegelenburg
>            Priority: Major
>         Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.
> _The point of this ticket is to improve the error handling and reduce the 
> number of passive -> active RM transition attempts (solving the above 
> described failure scenario isn't in scope)._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-7913) Improve error handling when application recovery fails with exception

Reply via email to