[ 
https://issues.apache.org/jira/browse/HADOOP-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj Das updated HADOOP-1183:
--------------------------------

    Attachment: 1183.patch

Retrials of map output fetches might overwrite the new events got from the JT 
for the same maps. Lets assume that a tasktracker is lost while we are in the 
process of fetching map outputs from it. There is a timing issue between when a 
mapoutput fetch completes with a failure, and when a new event for the same map 
task is obtained. If the latter is got before the former, and if the fetch 
corresponding to the new event is not scheduled before the former, then it will 
lead to loss of this new event (overwritten with the retrial for the old failed 
fetch).

The attached patch should handle this issue - here the FAILED events are 
explicitly handled. Please review it (while i am testing it on a big cluster).

> MapTask completion not recorded properly at the Reducer's end
> -------------------------------------------------------------
>
>                 Key: HADOOP-1183
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1183
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>            Priority: Critical
>         Attachments: 1183.patch
>
>
> A couple of reducers were continuously trying to fetch map outputs from a 
> lost tasktracker. Although the tasks running on that lost TT successfully 
> reexecuted elsewhere, the Reducers' tasktrackers didn't correctly note those 
> events.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to