[
https://issues.apache.org/jira/browse/HADOOP-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arun C Murthy updated HADOOP-1862:
----------------------------------
Attachment: HADOOP-1862_debug.patch
Ok, I was fortunate to be able to look at an actual job exhibiting this same
behaviour (thanks to Christian and HADOOP-1874) and here are some insights:
Essentially there seems like a issue with {{TaskCompletionEvent}}s received by
the {{ReduceTask}}. More than one *stuck* reducer (looking for map outputs)
never actually scheduled a copy from the maps from which it is missing outputs,
which leads me to believe that there is an issue with missing
{{TaskCompletionEvents}}s.
The other minor bug is the one I pointed out in my previous comment, so I've
attached another patch which incorporates:
a) previous fix
b) a debug statement to help track the received {{TaskCompletionEvent}}s at the
{{Reduce}} task.
Christian: I'd appreciate if you could try this out... Thanks!
> reduces are getting stuck trying to find map outputs
> ----------------------------------------------------
>
> Key: HADOOP-1862
> URL: https://issues.apache.org/jira/browse/HADOOP-1862
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.14.1
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
> Priority: Blocker
> Fix For: 0.15.0
>
> Attachments: HADOOP-1862_debug.patch, HADOOP-1862_prelim.patch
>
>
> Some of the reduces have been stuck for hours looking for 137 map outputs.
> When I look at the job events all 2600 of the maps have succeeded. There have
> been lots of lost task trackers and shuffle failures. The maps have been run
> between 1 to 6 times each. I do see some of the events in the task event log
> are marked OBSOLETE.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.