[ http://issues.apache.org/jira/browse/HADOOP-181?page=comments#action_12427033 ] Sameer Paranjpye commented on HADOOP-181: -----------------------------------------
I feel that improved detection of tasktracker death is a separate issue, which needs addressing. At the same time, we need to try and not lose work if communication between a tasktracker and the jobtracker fails for some reason. For instance, a tasktracker may appear lost to the jobtracker due to transient network problems. In such a case it would be ok for the jobtracker to mark the lost tasks as failed and reschedule them to other places. If communication to the jobtracker is subsequently restored, while the job is still in progress, the jobtracker can easily mark the lost and found tasks as succeeded. Multiple instances of a task should be handled by the speculative execution code. It seems like we could avoid losing a lot of work if we had such as mechanism in place. > task trackers should not restart for having a late heartbeat > ------------------------------------------------------------ > > Key: HADOOP-181 > URL: http://issues.apache.org/jira/browse/HADOOP-181 > Project: Hadoop > Issue Type: Bug > Components: mapred > Reporter: Owen O'Malley > Assigned To: Devaraj Das > Fix For: 0.6.0 > > Attachments: lost-heartbeat.patch > > > TaskTrackers should not close and restart themselves for having a late > heartbeat. The JobTracker should just accept their current status. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira