[
https://issues.apache.org/jira/browse/HADOOP-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Devaraj Das resolved HADOOP-1586.
---------------------------------
Resolution: Won't Fix
This issue is handled better in the related issue - HADOOP-1651
> Progress reporting thread can afford to be slightly lenient towards
> exceptions other than ConnectException
> ----------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-1586
> URL: https://issues.apache.org/jira/browse/HADOOP-1586
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.14.0
> Reporter: Devaraj Das
> Assignee: Devaraj Das
>
> Currently, in the loop of Task.startCommunicationThread, MAX_RETRIES (set to
> three) attempts are made to report progress/ping
> (TaskUmbilicalProtocol.progress or TaskUmbilicalProtocol.ping). All attempt
> failures are counted as critical. Here I am proposing a variant - treat only
> ConnectException exceptions are critical and treat the others as
> non-critical. The other exception could be the SocketTimeoutException in the
> case of the two RPCs.
> The reason why I am proposing this is that since HADOOP-1462 went in, I have
> been seeing quite a few unexpected 65 deaths, and with some logging it
> appears that they happen, most of the time, due to the SocketTimeoutException
> in the progress RPC call (before HADOOP-1462, the return value of progress
> would not be checked). And when the hack described above was put in, things
> improved considerably.
> One argument that one might make against the above proposal is that the
> tasktracker could be faulty, when a task is not able to successfully invoke
> an RPC on it even though it is able to connect. If this is indeed the case,
> even in the current scheme of things, the only resort is to restart the
> tasktracker (either manually, or, the JobTracker asks it to reinitialize),
> and in both the cases, normal behavior of the protocol will ensure that the
> child task will die (since the reinited tasktracker is going to return false
> for the progress/ping calls).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.