[
https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502222
]
Devaraj Das commented on HADOOP-1158:
-------------------------------------
Arun, I think this issue is kind of a slightly longer term solution a problem
and we have time enough with us to work towards that. I'd still argue that the
best place to have the logic behind killing the reduces is in the JobTracker.
Exceptions like the disk exception, ping exception, are very local cases where
a task decides to kill itself, but in this issue there is a certain element of
globalness involved (like dependency on maps), and the JobTracker is the only
guy who has a global picture of jobs. I don't see how we will lose simpilicity
by having the logic in the JobTracker. I understand that it will have to
maintain a few bytes more per task, but that's not unreasonable.
> JobTracker should collect statistics of failed map output fetches, and take
> decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty
> server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-1158
> URL: https://issues.apache.org/jira/browse/HADOOP-1158
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.12.2
> Reporter: Devaraj Das
> Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many
> times a fetch for a particular map output failed. If this exceeds a certain
> threshold, then that map should be declared as lost, and should be reexecuted
> elsewhere. Based on the number of such complaints from Reducers, the
> JobTracker can blacklist the TaskTracker. This will make the framework
> reliable - it will take care of (faulty) TaskTrackers that sometimes always
> fail to serve up map outputs (for which exceptions are not properly
> raised/handled, for e.g., if the exception/problem happens in the Jetty
> server).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.