Single lost heartbeat leads to a "Lost task tracker"
----------------------------------------------------

                 Key: HADOOP-1018
                 URL: https://issues.apache.org/jira/browse/HADOOP-1018
             Project: Hadoop
          Issue Type: Bug
    Affects Versions: 0.10.0, 0.11.2, 0.12.0
         Environment: Nutch trunk/ (Hadoop 0.10.0), Linux, JDK 1.5, a cluster 
of 9 machines.
            Reporter: Andrzej Bialecki 


Under heavy load, task tracker may lose the heartbeat response from the 
JobTracker. Task tracker tries to resend the last heartbeat message, which job 
tracker treats as "duplicate" response and ignores. Since task tracker tries to 
resend the same heartbeat message, with the same id, over and over again, no 
"valid" messages reach the job tracker, so after a while it considers the task 
tracker to be lost. Task tracker cannot recover from this state and needs to be 
restarted.

Looking at Hadoop trunk/ I believe this problem still may occur - in 
JobTracker.java.heartbeat():992 JobTracker should not ignore duplicate messages 
but acknowledge them without processing. This would cause the task tracker to 
sync back it's last heartbeat id with the last hearbeat id remembered in the 
job tracker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to