[ 
http://issues.apache.org/jira/browse/HADOOP-750?page=comments#action_12453764 ] 
            
Owen O'Malley commented on HADOOP-750:
--------------------------------------

The thread call stacks look like:

Thread 1525 (Thread-1403):
  State: WAITING
  Blocked count: 126
  Waited count: 1
  Waiting on [EMAIL PROTECTED]
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:474)
    
org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:207)

Thread 102 (Thread-89):
  State: TIMED_WAITING
  Blocked count: 7
  Waited count: 0
  Stack:
    java.lang.Thread.sleep(Native Method)
    
org.apache.hadoop.mapred.ReduceTaskRunner$MapCopyLeaseChecker.run(ReduceTaskRunner.java:303)

Thread 79 (Thread-66):
  State: WAITING
  Blocked count: 151
  Waited count: 26481
  Waiting on [EMAIL PROTECTED]
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:474)
    
org.apache.hadoop.mapred.ReduceTaskRunner.getCopyResult(ReduceTaskRunner.java:527)
    org.apache.hadoop.mapred.ReduceTaskRunner.prepare(ReduceTaskRunner.java:453)
    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:120)


> race condition on stalled map output fetches
> --------------------------------------------
>
>                 Key: HADOOP-750
>                 URL: http://issues.apache.org/jira/browse/HADOOP-750
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.8.0
>            Reporter: Owen O'Malley
>         Assigned To: Owen O'Malley
>             Fix For: 0.9.0
>
>
> I've seen reduces getting killed because of a race condition in the 
> ReduceTaskRunner.  In the logs it looks like:
> 2006-11-27 08:40:44,795 WARN org.apache.hadoop.mapred.TaskRunner: Map output 
> copy stalled on 
> http://kry2296.inktomisearch.com:7030/mapOutput?map=task_0001_m_015626_0
> ...
> 2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_0001_r_000658_0 Need 52 map output(s)
> 2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_0001_r_000658_0 Got 39 known map output location(s); scheduling...
> 2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_0001_r_000658_0 Scheduled 0 of 39 known outputs (0 slow hosts and 39 dup 
> hosts)
> ...
> 2006-11-27 09:16:47,071 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_0001_r_000658_0 0.3328575% reduce > copy (28679 of 28720 at 0.76 MB/s) >
> ...
> 2006-11-27 09:16:47,338 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_0001_r_000658_0 done copying task_0001_m_015462_0 output from node1
> ...
> 2006-11-27 09:36:51,398 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_0001_r_000658_0: Task failed to report status for 1204 seconds. Killing.
> Basically, the handling of the stall has a race condition that leaves the 
> fetcher in a bad state. At the end of the fetch, all of the tasks finish and 
> their results never get handled. When the thread times out, all of the map 
> output copiers are waiting for things to fetch and the prepare thread is 
> waiting for results.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to