[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart

Varun Saxena (JIRA) Thu, 15 Oct 2015 12:28:43 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959474#comment-14959474
 ]


Varun Saxena commented on YARN-4000:
------------------------------------

[~jianhe], I get it now as to what you meant when you say this will be a 
problem in regular case.

When doneApplicationAttempt is called, we mark the attempt in scheduler as 
stopped (we set SchedulerApplicationAttempt#isStopped to true).
In AbstractYarnScheduler#recoverContainersOnNode we will kill orphan containers 
if schedulerattempt is stopped which will be the case in scenario mentioned 
above except when application is marked to keep containers across application 
attempts.
{code}

      if (!rmApp.getApplicationSubmissionContext()
        .getKeepContainersAcrossApplicationAttempts()) {
        // Do not recover containers for stopped attempt or previous attempt.
        if (schedulerAttempt.isStopped()
            || !schedulerAttempt.getApplicationAttemptId().equals(
              container.getContainerId().getApplicationAttemptId())) {
          LOG.info("Skip recovering container " + container
              + " for already stopped attempt.");
          killOrphanContainerOnNode(nm, container);
          continue;
        }
      }
{code}

So if containers are kept across application attempts we should probably check 
if RMApp is killing. And if it is, do not recover containers. This although is 
not directly related to this JIRA. I can raise a separate JIRA for this and 
handle it there. Thoughts ?

> RM crashes with NPE if leaf queue becomes parent queue during restart
> ---------------------------------------------------------------------
>
>                 Key: YARN-4000
>                 URL: https://issues.apache.org/jira/browse/YARN-4000
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler, resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-4000.01.patch, YARN-4000.02.patch, 
> YARN-4000.03.patch, YARN-4000.04.patch, YARN-4000.05.patch, YARN-4000.06.patch
>
>
> This is a similar situation to YARN-2308.  If an application is active in 
> queue A and then the RM restarts with a changed capacity scheduler 
> configuration where queue A becomes a parent queue to other subqueues then 
> the RM will crash with a NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4000) RM crashes with NPE if leaf queue becomes parent queue during restart

Reply via email to