[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Joydeep Sen Sarma (JIRA) Sat, 27 Sep 2008 10:18:35 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635169#action_12635169
 ]


Joydeep Sen Sarma commented on HADOOP-4296:
-------------------------------------------

it almost seems like there is some kind of race condition between the jobs 
being completed and disappearing from the tracker?


        try {
          if (running.isComplete()) {
            break;
          }
          running = jc.getJob(jobId);
993:          String report =
            (" map " + StringUtils.formatPercent(running.mapProgress(), 0)+
             " reduce " +
             StringUtils.formatPercent(running.reduceProgress(), 0));

is it possible that the running.isComplete() is stale? I looked at the 
isComplete call a little bit more:

  static long MAX_JOBPROFILE_AGE = 1000 * 2;

    synchronized void ensureFreshStatus() throws IOException {
      if (System.currentTimeMillis() - statustime > MAX_JOBPROFILE_AGE) {


ouch - so what might be happening is the isComplete() returns false based on 
stale status. But when we request the job from the jobtracker - the job is gone 
(very high volume of jobs from the same user)?

Is this a plausible theory? If so - the fix would be simply be to reduce the 
MAX_JOBPROFILE_AGE to match the polling loop sleep time in jobclient.



> Spasm of JobClient failures on successful jobs every once in a while
> --------------------------------------------------------------------
>
>                 Key: HADOOP-4296
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4296
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Joydeep Sen Sarma
>            Priority: Critical
>
> At very busy times - we get a wave of job client failures all at the same 
> time. the failures come when the job is about to complete. when we look at 
> the job history files - the jobs are actually complete. Here's the stack:
> 08/09/27 02:18:00 INFO mapred.JobClient:  map 100% reduce 98%
> 08/09/27 02:18:41 INFO mapred.JobClient:  map 100% reduce 99% 
> java.lang.NullPointerException
>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993)
>       at 
> com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while

Reply via email to