[
https://issues.apache.org/jira/browse/HADOOP-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sameer Paranjpye updated HADOOP-1018:
-------------------------------------
Component/s: mapred
> Single lost heartbeat leads to a "Lost task tracker"
> ----------------------------------------------------
>
> Key: HADOOP-1018
> URL: https://issues.apache.org/jira/browse/HADOOP-1018
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.10.0, 0.11.2, 0.12.0
> Environment: Nutch trunk/ (Hadoop 0.10.0), Linux, JDK 1.5, a cluster
> of 9 machines.
> Reporter: Andrzej Bialecki
>
> Under heavy load, task tracker may lose the heartbeat response from the
> JobTracker. Task tracker tries to resend the last heartbeat message, which
> job tracker treats as "duplicate" response and ignores. Since task tracker
> tries to resend the same heartbeat message, with the same id, over and over
> again, no "valid" messages reach the job tracker, so after a while it
> considers the task tracker to be lost. Task tracker cannot recover from this
> state and needs to be restarted.
> Looking at Hadoop trunk/ I believe this problem still may occur - in
> JobTracker.java.heartbeat():992 JobTracker should not ignore duplicate
> messages but acknowledge them without processing. This would cause the task
> tracker to sync back it's last heartbeat id with the last hearbeat id
> remembered in the job tracker.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.