[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263197#comment-13263197
 ] 

Ravi Prakash commented on MAPREDUCE-4088:
-----------------------------------------

Courtesy Koji

{quote}
Each TaskTracker has a single thread for taskCleanup taking work from
tasksToCleanup queue.

For each task to cleanup, it firsts call 
  checkJobStatusAndWait(action);

And inside
 {noformat}
  private void checkJobStatusAndWait(TaskTrackerAction action)
...
    synchronized (runningJobs) {
      rjob = runningJobs.get(jobId);
    }
    if (rjob != null) {
      synchronized (rjob) {
        while (rjob.localizing) {
          rjob.wait();
        }
      }
    }
 {noformat}

So this thread would wait while the task is being localized.
Even if one task is hung on localization, entire cleanup is stopped.
{quote}

East or west! Koji is the best!
Soda lemon ginger pop! Koji is on the top!
Yyaayyy yaayyyy yaayyyy for Koji!
                
> Task stuck in JobLocalizer prevented other tasks on the same node from 
> committing
> ---------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4088
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4088
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv1
>    Affects Versions: 0.20.205.0
>            Reporter: Ravi Prakash
>            Assignee: Ravi Prakash
>            Priority: Critical
>
> We saw that as a result of HADOOP-6963, one task was stuck in this
> Thread 23668: (state = IN_NATIVE)
>  - java.io.UnixFileSystem.getBooleanAttributes0(java.io.File) @bci=0 
> (Compiled frame; information may be imprecise)
>  - java.io.UnixFileSystem.getBooleanAttributes(java.io.File) @bci=2, line=228 
> (Compiled frame)
>  - java.io.File.exists() @bci=20, line=733 (Compiled frame)
>  - org.apache.hadoop.fs.FileUtil.getDU(java.io.File) @bci=3, line=446 
> (Compiled frame)
>  - org.apache.hadoop.fs.FileUtil.getDU(java.io.File) @bci=52, line=455 
> (Compiled frame)
>  - org.apache.hadoop.fs.FileUtil.getDU(java.io.File) @bci=52, line=455 
> (Compiled frame)
> ....
> .... TONS MORE OF THIS SAME LINE
>  - org.apache.hadoop.fs.FileUtil.getDU(java.io.File) @bci=52, line=455 
> (Compiled frame)
> .....
> .....
>  - org.apache.hadoop.fs.FileUtil.getDU(java.io.File) @bci=52, line=455 
> (Compiled frame)
>  - org.apache.hadoop.fs.FileUtil.getDU(java.io.File) @bci=52, line=455 
> (Interpreted frame)
> ne=451 (Interpreted frame)
>  - 
> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration,
>  java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean) 
> @bci=150, line=324 (Interpreted frame)
>  - 
> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration)
>  @bci=40, line=349 (Interpreted frame) 51, line=383 (Interpreted frame)
>  - org.apache.hadoop.mapred.JobLocalizer.runSetup(java.lang.String, 
> java.lang.String, org.apache.hadoop.fs.Path, 
> org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=46, line=477 
> (Interpreted frame)
>  - org.apache.hadoop.mapred.JobLocalizer$3.run() @bci=20, line=534 
> (Interpreted frame)
>  - org.apache.hadoop.mapred.JobLocalizer$3.run() @bci=1, line=531 
> (Interpreted frame)
>  - 
> java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
>  java.security.AccessControlContext) @bci=0 (Interpreted frame)
>  - javax.security.auth.Subject.doAs(javax.security.auth.Subject, 
> java.security.PrivilegedExceptionAction) @bci=42, line=396 (Interpreted frame)
>  - 
> org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
>  @bci=14, line=1082 (Interpreted frame)
>  - org.apache.hadoop.mapred.JobLocalizer.main(java.lang.String[]) @bci=266, 
> line=530 (Interpreted frame)
> While all other tasks on the same node were stuck in 
> Thread 32141: (state = BLOCKED)
>  - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
>  - 
> org.apache.hadoop.mapred.Task.commit(org.apache.hadoop.mapred.TaskUmbilicalProtocol,
>  org.apache.hadoop.mapred.Task$TaskReporter, 
> org.apache.hadoop.mapreduce.OutputCommitter) @bci=24, line=980 (Compiled 
> frame)
>  - 
> org.apache.hadoop.mapred.Task.done(org.apache.hadoop.mapred.TaskUmbilicalProtocol,
>  org.apache.hadoop.mapred.Task$TaskReporter) @bci=146, line=871 (Interpreted 
> frame)
>  - org.apache.hadoop.mapred.ReduceTask.run(org.apache.hadoop.mapred.JobConf, 
> org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=470, line=423 
> (Interpreted frame)
>  - org.apache.hadoop.mapred.Child$4.run() @bci=29, line=255 (Interpreted 
> frame)
>  - 
> java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
>  java.security.AccessControlContext) @bci=0 (Interpreted frame)
>  - javax.security.auth.Subject.doAs(javax.security.auth.Subject, 
> java.security.PrivilegedExceptionAction) @bci=42, line=396 (Interpreted frame)
>  - 
> org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
>  @bci=14, line=1082 (Interpreted frame)
>  - org.apache.hadoop.mapred.Child.main(java.lang.String[]) @bci=738, line=249 
> (Interpreted frame)
> This should never happen. A stuck task should never prevent other tasks from 
> different jobs on the same node from committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to