[ http://issues.apache.org/jira/browse/HADOOP-506?page=comments#action_12440217 ] Sanjay Dahiya commented on HADOOP-506: --------------------------------------
One case in which I am able to reproduce this repeatedly is that job tracker restarts when tasktrackers are still running. basically UNKNOWN_TASKTRACKER status messages are not handled properly in job tracker. here is what happens in that case - synchronized (taskTrackers) { synchronized (trackerExpiryQueue) { boolean seenBefore = updateTaskTrackerStatus(trackerName, trackerStatus); if (initialContact) { <<<<==== This is false // If it's first contact, then clear out any state hanging around if (seenBefore) { lostTaskTracker(trackerName, trackerStatus.getHost()); } } else { // If not first contact, there should be some record of the tracker if (!seenBefore) { return InterTrackerProtocol.UNKNOWN_TASKTRACKER; <<<<=== returns this, but TT already in tasktrackers and not in expiryQueue } } if (initialContact) { trackerExpiryQueue.add(trackerStatus); <<<<==== not called } } } in updateTaskTrackerStatus if (oldStatus == null && initialContact == false ) then its a rogue status and should not be added to tasktrackers map in job tracker. I am investigating if this can happen in some other condition as well. > job tracker hangs on to dead task trackers "forever" > ---------------------------------------------------- > > Key: HADOOP-506 > URL: http://issues.apache.org/jira/browse/HADOOP-506 > Project: Hadoop > Issue Type: Bug > Components: mapred > Reporter: Yoram Arnon > Assigned To: Sanjay Dahiya > Priority: Minor > > I see cases where a task tracker gets disconnected from the job tracker and > disconnects, and then appears twice in the job tracker's list, with one > instance being alive and well, and the other's 'time since last heartbeat' > increasing monotonically. > that all makes sense. > What doesn't make sense, is that the old instances never expire. It's been > over 400000 seoncds since the last heartbeat. And the cluster reports having > more nodes up and running than its size (350 nodes in a 320 node cluster). > there should be some reasonable timout for these expired task trackers, > somewhere between 10 minutes and an hour. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira