[
https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12485481
]
Devaraj Das commented on HADOOP-1158:
-------------------------------------
Yes, the reduce/fetcher should try a few times (it could be configurable?)
before complaining to the JobTracker. The JobTracker can take a decision on
whether to reexecute a Map based on the % of complaints (>50% ?) from fetching
reduces. For example, if there are 10 reduces currently fetching, and if at
least 5 of them complained about a fetch failing for a particular Map, then the
JobTracker should reexecute that Map. Makes sense?
> JobTracker should collect statistics of failed map output fetches, and take
> decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty
> server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-1158
> URL: https://issues.apache.org/jira/browse/HADOOP-1158
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.12.2
> Reporter: Devaraj Das
>
> The JobTracker should keep a track (with feedback from Reducers) of how many
> times a fetch for a particular map output failed. If this exceeds a certain
> threshold, then that map should be declared as lost, and should be reexecuted
> elsewhere. Based on the number of such complaints from Reducers, the
> JobTracker can blacklist the TaskTracker. This will make the framework
> reliable - it will take care of (faulty) TaskTrackers that sometimes always
> fail to serve up map outputs (for which exceptions are not properly
> raised/handled, for e.g., if the exception/problem happens in the Jetty
> server).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.