[ 
http://issues.apache.org/jira/browse/HADOOP-181?page=comments#action_12427033 ] 
            
Sameer Paranjpye commented on HADOOP-181:
-----------------------------------------

I feel that improved detection of tasktracker death is a separate issue, which 
needs addressing. At the same time, we need to try and not lose work if 
communication between a tasktracker and the jobtracker fails for some reason.

For instance, a tasktracker may appear lost to the jobtracker due to transient 
network problems. In such a case it would be ok for the jobtracker to mark the 
lost tasks as failed and reschedule them to other places. 
If communication to the jobtracker is subsequently restored, while the job is 
still in progress, the jobtracker can 
easily mark the lost and found tasks as succeeded. Multiple instances of a task 
should be handled by the speculative execution code. It seems like we could 
avoid losing a lot of work if we had such as mechanism in place.




> task trackers should not restart for having a late heartbeat
> ------------------------------------------------------------
>
>                 Key: HADOOP-181
>                 URL: http://issues.apache.org/jira/browse/HADOOP-181
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>             Fix For: 0.6.0
>
>         Attachments: lost-heartbeat.patch
>
>
> TaskTrackers should not close and restart themselves for having a late 
> heartbeat. The JobTracker should just accept their current status.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to