[
http://issues.apache.org/jira/browse/HADOOP-506?page=comments#action_12440217 ]
Sanjay Dahiya commented on HADOOP-506:
--------------------------------------
One case in which I am able to reproduce this repeatedly is that job tracker
restarts when tasktrackers are still running. basically UNKNOWN_TASKTRACKER
status messages are not handled properly in job tracker.
here is what happens in that case -
synchronized (taskTrackers) {
synchronized (trackerExpiryQueue) {
boolean seenBefore = updateTaskTrackerStatus(trackerName,
trackerStatus);
if (initialContact) {
<<<<==== This is false
// If it's first contact, then clear out any state hanging around
if (seenBefore) {
lostTaskTracker(trackerName, trackerStatus.getHost());
}
} else {
// If not first contact, there should be some record of the tracker
if (!seenBefore) {
return InterTrackerProtocol.UNKNOWN_TASKTRACKER;
<<<<=== returns this, but TT
already in tasktrackers and not in expiryQueue
}
}
if (initialContact) {
trackerExpiryQueue.add(trackerStatus);
<<<<==== not called
}
}
}
in updateTaskTrackerStatus if (oldStatus == null && initialContact == false )
then its a rogue status and should not be added to tasktrackers map in job
tracker.
I am investigating if this can happen in some other condition as well.
> job tracker hangs on to dead task trackers "forever"
> ----------------------------------------------------
>
> Key: HADOOP-506
> URL: http://issues.apache.org/jira/browse/HADOOP-506
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Reporter: Yoram Arnon
> Assigned To: Sanjay Dahiya
> Priority: Minor
>
> I see cases where a task tracker gets disconnected from the job tracker and
> disconnects, and then appears twice in the job tracker's list, with one
> instance being alive and well, and the other's 'time since last heartbeat'
> increasing monotonically.
> that all makes sense.
> What doesn't make sense, is that the old instances never expire. It's been
> over 400000 seoncds since the last heartbeat. And the cluster reports having
> more nodes up and running than its size (350 nodes in a 320 node cluster).
> there should be some reasonable timout for these expired task trackers,
> somewhere between 10 minutes and an hour.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira