Daryn Sharp created YARN-2964:
---------------------------------

             Summary: RM prematurely cancels tokens for jobs that submit jobs 
(oozie)
                 Key: YARN-2964
                 URL: https://issues.apache.org/jira/browse/YARN-2964
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.6.0
            Reporter: Daryn Sharp
            Priority: Critical


The RM used to globally track the unique set of tokens for all apps.  It 
remembered the first job that was submitted with the token.  The first job 
controlled the cancellation of the token.  This prevented completion of 
sub-jobs from canceling tokens used by the main job.

As of YARN-2704, the RM now tracks tokens on a per-app basis.  There is no 
notion of the first/main job.  This results in sub-jobs canceling tokens and 
failing the main job and other sub-jobs.  It also appears to schedule multiple 
redundant renewals.

The issue is not immediately obvious because the RM will cancel tokens ~10 min 
(NM livelyness interval) after log aggregation completes.  The result is an 
oozie job, ex. pig, that will launch many sub-jobs over time will fail if any 
sub-jobs are launched >10 min after any sub-job completes.  If all other 
sub-jobs complete within that 10 min window, then the issue goes unnoticed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to