[ http://issues.apache.org/jira/browse/HADOOP-750?page=all ]

Doug Cutting updated HADOOP-750:
--------------------------------

        Status: Resolved  (was: Patch Available)
    Resolution: Fixed

I just fixed this.  Thanks, Owen!

> race condition on stalled map output fetches
> --------------------------------------------
>
>                 Key: HADOOP-750
>                 URL: http://issues.apache.org/jira/browse/HADOOP-750
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.8.0
>            Reporter: Owen O'Malley
>         Assigned To: Owen O'Malley
>             Fix For: 0.9.0
>
>         Attachments: fetch-no-lease.patch
>
>
> I've seen reduces getting killed because of a race condition in the 
> ReduceTaskRunner.  In the logs it looks like:
> 2006-11-27 08:40:44,795 WARN org.apache.hadoop.mapred.TaskRunner: Map output 
> copy stalled on 
> http://kry2296.inktomisearch.com:7030/mapOutput?map=task_0001_m_015626_0
> ...
> 2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_0001_r_000658_0 Need 52 map output(s)
> 2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_0001_r_000658_0 Got 39 known map output location(s); scheduling...
> 2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_0001_r_000658_0 Scheduled 0 of 39 known outputs (0 slow hosts and 39 dup 
> hosts)
> ...
> 2006-11-27 09:16:47,071 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_0001_r_000658_0 0.3328575% reduce > copy (28679 of 28720 at 0.76 MB/s) >
> ...
> 2006-11-27 09:16:47,338 INFO org.apache.hadoop.mapred.TaskRunner: 
> task_0001_r_000658_0 done copying task_0001_m_015462_0 output from node1
> ...
> 2006-11-27 09:36:51,398 INFO org.apache.hadoop.mapred.TaskTracker: 
> task_0001_r_000658_0: Task failed to report status for 1204 seconds. Killing.
> Basically, the handling of the stall has a race condition that leaves the 
> fetcher in a bad state. At the end of the fetch, all of the tasks finish and 
> their results never get handled. When the thread times out, all of the map 
> output copiers are waiting for things to fetch and the prepare thread is 
> waiting for results.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to