[ 
https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HADOOP-1862:
----------------------------------

        Fix Version/s:     (was: 0.15.0)
                       14.2
    Affects Version/s: 0.13.0
                       0.13.1
                       0.14.0

Ok, I just found this one and it is a bad one. The problem is that the 
TaskTracker in fetchMapCompletionEvents stores a cache of the completion events 
indexed by TIP id instead TASK id. So there is a event race condition between 
tasks in the same tip and if the last event is the failed one, then the reduces 
get stuck, because that map is marked as never completed. I've marked this as a 
0.14.2 fix, but we might need a 0.13 fix too.

> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
>                 Key: HADOOP-1862
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1862
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.13.0, 0.13.1, 0.14.0, 0.14.1
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Blocker
>             Fix For: 14.2
>
>         Attachments: HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs. 
> When I look at the job events all 2600 of the maps have succeeded. There have 
> been lots of lost task trackers and shuffle failures. The maps have been run 
> between 1 to 6 times each. I do see some of the events in the task event log 
> are marked OBSOLETE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to