QuorumCnxManager blocks forever
--------------------------------
Key: ZOOKEEPER-914
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
Project: Zookeeper
Issue Type: Bug
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
This was a disaster. While testing our application we ran into a scenario where
a rebooted follower could not join the cluster. Further debugging showed that
the follower could not join because the QuorumCnxManager on the leader was
blocked for indefinite amount of time in receiveConnect()
"Thread-3" prio=10 tid=0x00007fa920005800 nid=0x11bb runnable
[0x00007fa9275ed000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
at sun.nio.ch.IOUtil.read(IOUtil.java:206)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
- locked <0x00007fa93315f988> (a java.lang.Object)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
I had pointed out this bug along with several other problems in
QuorumCnxManager earlier in
https://issues.apache.org/jira/browse/ZOOKEEPER-900 and
https://issues.apache.org/jira/browse/ZOOKEEPER-822.
I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix
and a patch will be out soon.
The problem is that QuorumCnxManager is using SocketChannel in blocking mode.
It does a read() in receiveConnection() and a write() in initiateConnection().
Sorry, but this is really bad programming. Also, points out to lack of failure
tests for QuorumCnxManager.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.