[ 
http://issues.apache.org/jira/browse/HADOOP-181?page=comments#action_12427680 ] 
            
Owen O'Malley commented on HADOOP-181:
--------------------------------------

I agree that the original desire for this patch was born of the TaskTracker 
timeouts that shouldn't happen. Fixing those problems (and we _have_ fixed most 
of them over the last 4 months) should take precendence. However, that said, I 
think in the long term, we do want something like this patch. If a switch goes 
down for 15 minutes and then comes back up, it does not make sense to 
reshuffle, resort, and rerun a reduce that takes hours to run.

All map/reduce applications, even those with speculative execution turned off, 
must permit redundant copies of their tasks for precisely this reason. In this 
case, the JobTracker has decided a given task is dead, but hasn't been able to 
tell the responsible TaskTracker yet. Therefore it schedules another instance 
of the failed task on a different node. Therefore, they are going to run in 
parallel for a while.

I guess for now, let's sit on this patch and contemplate what the model should 
be for dealing with communication problems. We should also monitor this in real 
use and see how often task trackers are being lost and probably put some effort 
to determine at least whether it is the job tracker or the task tracker that is 
the cause of the delay.

> task trackers should not restart for having a late heartbeat
> ------------------------------------------------------------
>
>                 Key: HADOOP-181
>                 URL: http://issues.apache.org/jira/browse/HADOOP-181
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>             Fix For: 0.6.0
>
>         Attachments: lost-heartbeat.patch
>
>
> TaskTrackers should not close and restart themselves for having a late 
> heartbeat. The JobTracker should just accept their current status.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to