[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Arun C Murthy (JIRA) Tue, 05 Jun 2007 08:28:48 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501591
 ]


Arun C Murthy commented on HADOOP-1158:
---------------------------------------

Some early thoughts...

Bottomline: we don't want the reducer and hence the job to get stuck forever. 

The main issue is that when a reducer is stuck in shuffle it's hard to 
accurately say whether the fault lies at the map (jetty acting weird) or at the 
reduce or both. Having said that it's pertinent to keep in mind that _normally_ 
maps are cheaper to re-execute.

Given the above I'd like to propose something along these lines:

a) The reduce maintains a per-map count of fetch failures.

b) Given sufficient fetch-failures per-map (say 3 or 4), the reducer then 
complains to the JobTracker via a new rpc: 
{code:title=JobTracker.java}
public synchronized void notifyFailedFetch(String reduceTaskId, String 
mapTaskId) {
  // ...
}
{code}

c) The JobTracker maintains a per-map count of failed-fetch notfications, and 
given a sufficient no. of them (say 2/3?) from *any* reducer (even multiple 
times from the same reducer) fails the map and re-schedules it elsewhere.
  
  This handles 2 cases: a) Faulty maps are re-executed and b) Corner case where 
only the last reducer is stuck on a given map and hence the map will have to be 
re-executed.

d) To counter the case of faulty reduces, we could implement a scheme where the 
reducer kills itself when it notifies the JobTracker of more than, say 5 
unique, faulty fetches. This will ensure that a faulty reducer will not result 
in the JobTracker spawning maps willy-nilly...

Thoughts?

> JobTracker should collect statistics of failed map output fetches, and take 
> decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty 
> server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many 
> times a fetch for a particular map output failed. If this exceeds a certain 
> threshold, then that map should be declared as lost, and should be reexecuted 
> elsewhere. Based on the number of such complaints from Reducers, the 
> JobTracker can blacklist the TaskTracker. This will make the framework 
> reliable - it will take care of (faulty) TaskTrackers that sometimes always 
> fail to serve up map outputs (for which exceptions are not properly 
> raised/handled, for e.g., if the exception/problem happens in the Jetty 
> server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker

Reply via email to