Progress reporting thread can afford to be slightly lenient towards exceptions
other than ConnectException
----------------------------------------------------------------------------------------------------------
Key: HADOOP-1586
URL: https://issues.apache.org/jira/browse/HADOOP-1586
Project: Hadoop
Issue Type: Bug
Components: mapred
Affects Versions: 0.14.0
Reporter: Devaraj Das
Assignee: Devaraj Das
Fix For: 0.14.0
Currently, in the loop of Task.startCommunicationThread, MAX_RETRIES (set to
three) attempts are made to report progress/ping
(TaskUmbilicalProtocol.progress or TaskUmbilicalProtocol.ping). All attempt
failures are counted as critical. Here I am proposing a variant - treat only
ConnectException exceptions are critical and treat the others as non-critical.
The other exception could be the SocketTimeoutException in the case of the two
RPCs.
The reason why I am proposing this is that since HADOOP-1462 went in, I have
been seeing quite a few unexpected 65 deaths, and with some logging it appears
that they happen, most of the time, due to the SocketTimeoutException in the
progress RPC call (before HADOOP-1462, the return value of progress would not
be checked). And when the hack described above was put in, things improved
considerably.
One argument that one might make against the above proposal is that the
tasktracker could be faulty, when a task is not able to successfully invoke an
RPC on it even though it is able to connect. If this is indeed the case, even
in the current scheme of things, the only resort is to restart the tasktracker
(either manually, or, the JobTracker asks it to reinitialize), and in both the
cases, normal behavior of the protocol will ensure that the child task will die
(since the reinited tasktracker is going to return false for the progress/ping
calls).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.