[ 
http://issues.apache.org/jira/browse/HADOOP-506?page=comments#action_12440217 ] 
            
Sanjay Dahiya commented on HADOOP-506:
--------------------------------------

One case in which I am able to reproduce this repeatedly is that job tracker 
restarts when tasktrackers are still running. basically UNKNOWN_TASKTRACKER 
status messages are not handled properly in job tracker. 

here is what happens in that case - 

synchronized (taskTrackers) {
    synchronized (trackerExpiryQueue) {
        boolean seenBefore = updateTaskTrackerStatus(trackerName,
                                                     trackerStatus);
        if (initialContact) {                                                   
<<<<==== This is false 
            // If it's first contact, then clear out any state hanging around
            if (seenBefore) {           
                lostTaskTracker(trackerName, trackerStatus.getHost());
            }
        } else {
            // If not first contact, there should be some record of the tracker
            if (!seenBefore) {
                return InterTrackerProtocol.UNKNOWN_TASKTRACKER;    
                                                  <<<<=== returns this, but TT 
already in tasktrackers and not in expiryQueue
            }
        }

        if (initialContact) {
            trackerExpiryQueue.add(trackerStatus);                              
 <<<<==== not called 
        }
    }
}


in updateTaskTrackerStatus if (oldStatus == null && initialContact == false ) 
then its a rogue status and should not be added to tasktrackers map in job 
tracker. 

I am investigating if this can happen in some other condition as well.


> job tracker hangs on to dead task trackers "forever"
> ----------------------------------------------------
>
>                 Key: HADOOP-506
>                 URL: http://issues.apache.org/jira/browse/HADOOP-506
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Yoram Arnon
>         Assigned To: Sanjay Dahiya
>            Priority: Minor
>
> I see cases where a task tracker gets disconnected from the job tracker and 
> disconnects, and then appears twice in the job tracker's list, with one 
> instance being alive and well, and the other's 'time since last heartbeat' 
> increasing monotonically.
> that all makes sense.
> What doesn't make sense, is that the old instances never expire. It's been 
> over 400000 seoncds since the last heartbeat. And the cluster reports having 
> more nodes up and running than its size (350 nodes in a 320 node cluster).
> there should be some reasonable timout for these expired task trackers, 
> somewhere between 10 minutes and an hour.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to