Re: [jira] Commented: (HADOOP-639) task cleanup messages can get lost, causing task trackers to keep tasks forever

Nigel Daley Wed, 22 Nov 2006 09:01:04 -0800

Arun, the proposal looks good. If the JT always gets a stale seqNofrom the TT (because of some unrecoverable problem in the TT), willit send the saved response forever? Or should there be some maximumresends?

Also, when the JT is resending a JTResponse, can it add or change thelist of actions? Or do they need to be identical?

Is it possible that a TT can get the same JTResponse more than once?If so, does the TT need to recognize this?



On Nov 22, 2006, at 8:25 AM, Arun C Murthy (JIRA) wrote:

[ http://issues.apache.org/jira/browse/HADOOP-639?page=comments#action_12451982 ]
Arun C Murthy commented on HADOOP-639:
--------------------------------------
Ok, while we continue to track the *metrics* part ofTaskTrackerStatus via HADOOP-657 I propose we move forward on the'lost messages' part over here...
Here are some thoughts (with due credits to Owen):

Define new classes:

TaskTrackerAction implements Writable {
  byte actionId = 0;
  // ...
}

KillJobAction extends TaskTrackerAction {
  byte actionId = 1;
  // ...
}
KillTaskAction extends TaskTrackerAction {
  byte actionId = 2;
  // ...
}
StartTaskAction extends TaskTrackerAction {
  byte actionId = 3;
  // ...
}
The distinction between the KillTaskAction & KillJobAction is doneto fix HADOOP-737 ...
Another class:
class JTResponse {
  long seqNo;                  // explained below
  List<TaskTrackerAction> actions;
}

The new api replacing
int emitHeartbeat(TaskTrackerStatus status, booleaninitialContact) throws IOException;
  Task pollForNewTask(String trackerName) throws IOException;
String[] pollForTaskWithClosedJob(String trackerName) throwsIOException;
is:
* JTResponse updateStatus(TaskTrackerStatus status, long ackNo)throws IOException; *
Details about the seqNo/ackNo:
------------------------------------
The idea is that there is a feedback (seq/ack) mechanism betweenthe JT & TT which works as follows...
TT starts off by sending an ack of '-1' (indicates initial contact,replaces the existing 'initialContact' boolean); and at every stepthe JT increments the ack and sends a new JTResponse object withthe incremented ack as the 'seqNo' and 'actions'. The JT alsostores the last seq and the JTResponse object sent to each of thetask-trackers. OTOH the TT also stores the last 'seq' which itrecieved from the JT, which is what it sends out in the subsequentheartbeat as 'ack'.
How does this help? If a TT misses the heartbeat response from theJT, it sends a stale which 'ack' disagrees with the newer 'seq' onthe JT, this prompts the JT to resend the 'saved' JTResponse objectback to the TT... thus solving the 'lost messages' issue. If JTnever hears back from a TT for a long time the existingExpireTrackers.run removes the TT from its queue and also discardsthe saved JTResponse object for that TT.
-*-*-

Thoughts?
task cleanup messages can get lost, causing task trackers to keeptasks forever-------------------------------------------------------------------------------
                Key: HADOOP-639
                URL: http://issues.apache.org/jira/browse/HADOOP-639
            Project: Hadoop
         Issue Type: Bug
         Components: mapred
   Affects Versions: 0.7.2
           Reporter: Owen O'Malley
        Assigned To: Owen O'Malley
            Fix For: 0.9.0
If the pollForTaskWithClosedJob call from a job tracker to a tasktracker times out when a job completes, the tasks are nevercleaned up. This can cause the mini m/r cluster to hang onshutdown, but also is a resource leak.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of theadministrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] Commented: (HADOOP-639) task cleanup messages can get lost, causing task trackers to keep tasks forever

Reply via email to