[
https://issues.apache.org/jira/browse/HADOOP-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648138#action_12648138
]
Steve Loughran commented on HADOOP-4659:
----------------------------------------
Raghu,
> why does Client wrap one IOException in another?
I dont know the original reason; HADOOP-3844 retained this feature and included
the hostname/port at fault which is handy for identifying configuration
problems. The patch only adds this diagnostics to ConnectExceptions and passes
the rest up
>is this a vanilla 0.18?
I'm only work with SVN_HEAD; it's present there. If Hairong thinks it came in
with HADOOP-2188, then it also exists in 0.18, but that will need a different
patch.
> Also , "org.apache.hadoop.ipc.Client.call" does not actually catch exception
> from getConnection() ...
Client.call doesnt catch the exception. The problem is that RPC.waitForProxy
does, and it handles ConnectException and SocketTimeoutException by logging,
sleeping, and trying again. This was not happening when the ConnectException
was being downgraded, so the task tracker was failing if it came up before the
job tracker, rather than waiting quietly for the tracker to come back up. As a
result there is a race condition in cluster startup and the cluster is more
brittle
Here's where the exceptions get picked up in RPC.java
public static VersionedProtocol waitForProxy(Class protocol,
long clientVersion,
InetSocketAddress addr,
Configuration conf
) throws IOException {
while (true) {
try {
return getProxy(protocol, clientVersion, addr, conf);
} catch(ConnectException se) { // namenode has not been started
LOG.info("Server at " + addr + " not available yet, Zzzzz...");
} catch(SocketTimeoutException te) { // namenode is busy
LOG.info("Problem connecting to server: " + addr);
}
try {
Thread.sleep(1000);
} catch (InterruptedException ie) {
// IGNORE
}
}
}
> Root cause of connection failure is being lost to code that uses it for
> delaying startup
> ----------------------------------------------------------------------------------------
>
> Key: HADOOP-4659
> URL: https://issues.apache.org/jira/browse/HADOOP-4659
> Project: Hadoop Core
> Issue Type: Bug
> Components: ipc
> Affects Versions: 0.18.3
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Fix For: 0.18.3
>
> Attachments: hadoop-4659.patch
>
>
> ipc.Client the root cause of a connection failure is being lost as the
> exception is wrapped, hence the outside code, the one that looks for that
> root cause, isn't working as expected. The results is you can't bring up a
> task tracker before job tracker, and probably the same for a datanode before
> a namenode. The change that triggered this is not yet located, I had thought
> it was HADOOP-3844 but I no longer believe this is the case.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.