[ 
https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501859
 ] 

Arun C Murthy commented on HADOOP-1158:
---------------------------------------

bq. The reduce should kill itself when it fails to fetch the map output from 
even the new location, i.e., the unique 5 faulty fetches should have at least 1 
retrial (i.e., we don't kill a reduce too early).

Though it makes sense in the long-term I'd vote we keep it simple for now... to 
implement this would entail more complex code and more state to be maintained. 
5 notifications anyway mean that the reducer has seen 20 attempts to fetch on 5 
different maps fail. I'd say, for now, it's a sufficient reason to kill the 
reducer.

bq.Also, does it make sense to have the logic behind killing/reexecuting 
reduces in the JobTracker. Two reasons:
bq.1) since the JobTracker knows very well how many times a reduce complained, 
and, for which maps it complained, etc.,

If the reducer kills itself, the JobTracker need not maintain information of 
*which* reduces failed to fetch *which* maps, it could just do with a 
per-taskid count of failed fetches (for the maps, as notified by reducers) - 
again leads to simpler code for a first-shot. 

bq.2) consistent behavior - jobtracker handles the reexecution of maps and it 
might handle the reexecution of reduces as well.

I agree with the general sentiment, but given that this leads to more complex 
code and the reducer already knows it has failed to fetch from 5 different maps 
it doesn't make sense for it to wait for the JobTracker to fail the task. Also, 
there is an existing precedent for this behaviour in TaskTracker.fsError (task 
is marked as 'failed' by the TaskTracker itself on an FSError).

Thoughts?

> JobTracker should collect statistics of failed map output fetches, and take 
> decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty 
> server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many 
> times a fetch for a particular map output failed. If this exceeds a certain 
> threshold, then that map should be declared as lost, and should be reexecuted 
> elsewhere. Based on the number of such complaints from Reducers, the 
> JobTracker can blacklist the TaskTracker. This will make the framework 
> reliable - it will take care of (faulty) TaskTrackers that sometimes always 
> fail to serve up map outputs (for which exceptions are not properly 
> raised/handled, for e.g., if the exception/problem happens in the Jetty 
> server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to