[
https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501591
]
Arun C Murthy commented on HADOOP-1158:
---------------------------------------
Some early thoughts...
Bottomline: we don't want the reducer and hence the job to get stuck forever.
The main issue is that when a reducer is stuck in shuffle it's hard to
accurately say whether the fault lies at the map (jetty acting weird) or at the
reduce or both. Having said that it's pertinent to keep in mind that _normally_
maps are cheaper to re-execute.
Given the above I'd like to propose something along these lines:
a) The reduce maintains a per-map count of fetch failures.
b) Given sufficient fetch-failures per-map (say 3 or 4), the reducer then
complains to the JobTracker via a new rpc:
{code:title=JobTracker.java}
public synchronized void notifyFailedFetch(String reduceTaskId, String
mapTaskId) {
// ...
}
{code}
c) The JobTracker maintains a per-map count of failed-fetch notfications, and
given a sufficient no. of them (say 2/3?) from *any* reducer (even multiple
times from the same reducer) fails the map and re-schedules it elsewhere.
This handles 2 cases: a) Faulty maps are re-executed and b) Corner case where
only the last reducer is stuck on a given map and hence the map will have to be
re-executed.
d) To counter the case of faulty reduces, we could implement a scheme where the
reducer kills itself when it notifies the JobTracker of more than, say 5
unique, faulty fetches. This will ensure that a faulty reducer will not result
in the JobTracker spawning maps willy-nilly...
Thoughts?
> JobTracker should collect statistics of failed map output fetches, and take
> decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty
> server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-1158
> URL: https://issues.apache.org/jira/browse/HADOOP-1158
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.12.2
> Reporter: Devaraj Das
> Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many
> times a fetch for a particular map output failed. If this exceeds a certain
> threshold, then that map should be declared as lost, and should be reexecuted
> elsewhere. Based on the number of such complaints from Reducers, the
> JobTracker can blacklist the TaskTracker. This will make the framework
> reliable - it will take care of (faulty) TaskTrackers that sometimes always
> fail to serve up map outputs (for which exceptions are not properly
> raised/handled, for e.g., if the exception/problem happens in the Jetty
> server).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.