[
https://issues.apache.org/jira/browse/HADOOP-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875431#action_12875431
]
Todd Lipcon commented on HADOOP-6762:
-------------------------------------
bq. re: timeout, so if a server disappeared, the ping would fail and the RPC
would fail that way? if that's the case, then I think removing the timeout on
the Future.get() is fine.
Yep, that should be the case. Of course a server can stay up but be
unresponsive (eg deadlocked). In those cases, while it's annoying that clients
get blocked forever, I don't know that changing the behavior to be timeout
based would be a change we could really make at this point without worrying
that it would break lots and lots of downstream users :(
bq. We have seem one case of distributed deadlock here on the IPC workers in
the DN, so this isn't 100% theory
Yep, I've seen internode deadlocks several times as well. Not pretty! However,
I can't think of a situation where this could happen here -- the only thing
that can block one of these sendParam calls is TCP backpressure on the socket,
and that only happens when the network is stalled. I don't see a case where
allowing other threads to start sending would have unstalled a prior sender.
We could actually enforce the max one thread per connection thing by
synchronizing on Connection.this.out *outside* the submission of the runnable.
That way we know there's only one sending going on at a time, and we're just
using the thread exactly for avoiding interruption and nothing else.
> exception while doing RPC I/O closes channel
> --------------------------------------------
>
> Key: HADOOP-6762
> URL: https://issues.apache.org/jira/browse/HADOOP-6762
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 0.20.2
> Reporter: sam rash
> Assignee: sam rash
> Attachments: hadoop-6762-1.txt, hadoop-6762-2.txt, hadoop-6762-3.txt,
> hadoop-6762-4.txt, hadoop-6762-6.txt
>
>
> If a single process creates two unique fileSystems to the same NN using
> FileSystem.newInstance(), and one of them issues a close(), the leasechecker
> thread is interrupted. This interrupt races with the rpc namenode.renew()
> and can cause a ClosedByInterruptException. This closes the underlying
> channel and the other filesystem, sharing the connection will get errors.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.