[
https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929589#action_12929589
]
Vishal K commented on ZOOKEEPER-914:
------------------------------------
Hi Flavio,
You are right. Sorry, my comment was not fair.
Regarding SO_TIMEOUT: Per my understanding, SO_TIMEOUT works only when a
channel is set in non-blocking mode using isConfigureBlocking(). If the channel
is not configured to work in non-blocking mode, setting SO_TIMEOUT has no
effect. Please let me know if you think there is a way to set timeout on the
socket after accepting the connection (without configuring the channel in
non-blocking mode). The only way I know to use SO_TIMEOUT is by using
channel.isConfigureBlocking(false). The current code in QuorumCnxManager
assumes use of blocking IO. We will have to handle partial reads/writes. Please
refer to my earlier question regarding SO_TIMEOUT for implementing non-blocking
IO.
I thought this fix was supposed to go in for 3.3.3. As I suggested earlier, one
quick fix to the problem is to use TimerTask(). Before doing blocking IO we can
start a timer for that channel (in receiveConnect() before read). Once the
timer expires, check if the read() has finished. If not, interrupt and close
the channel. I think having such a fix (or some other fix that will get around
the problem) until the real fix is in is a better approach. Let me what you
think?
If we decide to go one of the quick fixes, then we can use this JIRA for that
and use ZOOKEEPER-900 for the real fix.. Otherwise, as you suggested, we can
close this JIRA and use ZOOKEEPER-900.
-Vishal
> QuorumCnxManager blocks forever
> --------------------------------
>
> Key: ZOOKEEPER-914
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
> Project: Zookeeper
> Issue Type: Bug
> Components: leaderElection
> Reporter: Vishal K
> Assignee: Vishal K
> Priority: Blocker
> Fix For: 3.3.3, 3.4.0
>
>
> This was a disaster. While testing our application we ran into a scenario
> where a rebooted follower could not join the cluster. Further debugging
> showed that the follower could not join because the QuorumCnxManager on the
> leader was blocked for indefinite amount of time in receiveConnect()
> "Thread-3" prio=10 tid=0x00007fa920005800 nid=0x11bb runnable
> [0x00007fa9275ed000]
> java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
> at sun.nio.ch.IOUtil.read(IOUtil.java:206)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
> - locked <0x00007fa93315f988> (a java.lang.Object)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
> I had pointed out this bug along with several other problems in
> QuorumCnxManager earlier in
> https://issues.apache.org/jira/browse/ZOOKEEPER-900 and
> https://issues.apache.org/jira/browse/ZOOKEEPER-822.
> I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix
> and a patch will be out soon.
> The problem is that QuorumCnxManager is using SocketChannel in blocking mode.
> It does a read() in receiveConnection() and a write() in initiateConnection().
> Sorry, but this is really bad programming. Also, points out to lack of
> failure tests for QuorumCnxManager.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.