This might be a wide shot and I did not see exactly the same error, but with corretto jdk 11.0.6 I had also issue that ZK could not a quorum. I downgraded to 11.0.5 and it did not have an issues. This was on ZK 3.5.5 with Kerberos authentication and authorization.
> Am 19.02.2020 um 13:27 schrieb Szalay-Bekő Máté <szalay.beko.m...@gmail.com>: > > Hi Steve! > > If you are using ZooKeeper newer than 3.5.0, then this might be the issue > we are just discussing / trying to fix in ZOOKEEPER-2164. > Can you test your setup with a config where you don't use 0.0.0.0 in the > server addresses? > > If you need to bind to the 0.0.0.0 locally, then please set the > 'quorumListenOnAllIPs' config property to true. > > like: > > # usually you don't really need this, unless if you actually need to bind > to multiple IPs > quorumListenOnAllIPs=true > > # it is best if all the zoo.cfg files contain the same address settings, > and we don't use 0.0.0.0 here > server.1=zookeeper1:2888:3888 > server.2=zookeeper2:2888:3888 > server.3=zookeeper3:2888:3888 > > Kind regards, > Mate > >> On Wed, Feb 19, 2020 at 6:08 AM Steve Jerman <st...@kloudspot.com> wrote: >> >> Hello folks, >> >> Wonder if anyone can help me. Suspect it must be something simple but I >> cant see it. Any suggestions about how to diagnose would be gratefully >> received. >> >> I have a three node ZK cluster, when it starts up only two of the nodes >> form a quorum. If I restart the leader the quorum reforms with the other >> two node… >> >> Thanks in advance for any help >> Steve >> >> This is the ‘stat’ for the leader/follow… >> >> bash-5.0$ echo stat | nc zookeeper1 2181 >> This ZooKeeper instance is not currently serving requests >> >> bash-5.0$ echo stat | nc zookeeper2 2181 >> Zookeeper version: 3.5.7-f0fdd52973d373ffd9c86b81d99842dc2c7f660e, built >> on 02/10/2020 11:30 GMT >> Clients: >> /10.0.1.152:44910[1](queued=0,recved=151,sent=151) >> /10.0.1.140:53138[1](queued=0,recved=187,sent=187) >> /10.0.1.143:57422[1](queued=0,recved=151,sent=151) >> /10.0.1.152:59242[0](queued=0,recved=1,sent=0) >> /10.0.1.143:40826[1](queued=0,recved=1139,sent=1139) >> /10.0.1.152:49188[1](queued=0,recved=200,sent=203) >> /10.0.1.152:59548[1](queued=0,recved=1157,sent=1159) >> /10.0.1.140:36624[1](queued=0,recved=151,sent=151) >> >> Latency min/avg/max: 0/0/5 >> Received: 3338 >> Sent: 3342 >> Connections: 8 >> Outstanding: 0 >> Zxid: 0xc000000f3 >> Mode: follower >> Node count: 181 >> >> bash-5.0$ echo stat | nc zookeeper3 2181 >> Zookeeper version: 3.5.7-f0fdd52973d373ffd9c86b81d99842dc2c7f660e, built >> on 02/10/2020 11:30 GMT >> Clients: >> /10.0.1.152:49428[0](queued=0,recved=1,sent=0) >> /10.0.1.140:32912[1](queued=0,recved=1426,sent=1429) >> >> Latency min/avg/max: 0/0/4 >> Received: 1684 >> Sent: 1686 >> Connections: 2 >> Outstanding: 0 >> Zxid: 0xc000000f4 >> Mode: leader >> Node count: 181 >> Proposal sizes last/min/max: 32/32/406 >> bash-5.0$ >> >> The trace for the failing node is: >> >> server.1=0.0.0.0:2888:3888 >> server.2=zookeeper2:2888:3888 >> server.3=zookeeper3:2888:3888 >> ZooKeeper JMX enabled by default >> Using config: /opt/zookeeper/bin/../conf/zoo.cfg >> 2020-02-19 04:23:27,759 [myid:] - INFO [main:QuorumPeerConfig@135] - >> Reading configuration from: /opt/zookeeper/bin/../conf/zoo.cfg >> 2020-02-19 04:23:27,764 [myid:] - INFO [main:QuorumPeerConfig@387] - >> clientPortAddress is 0.0.0.0:2181 >> 2020-02-19 04:23:27,764 [myid:] - INFO [main:QuorumPeerConfig@391] - >> secureClientPort is not set >> 2020-02-19 04:23:27,771 [myid:1] - INFO [main:DatadirCleanupManager@78] >> - autopurge.snapRetainCount set to 3 >> 2020-02-19 04:23:27,772 [myid:1] - INFO [main:DatadirCleanupManager@79] >> - autopurge.purgeInterval set to 24 >> 2020-02-19 04:23:27,772 [myid:1] - INFO >> [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started. >> 2020-02-19 04:23:27,773 [myid:1] - INFO [main:ManagedUtil@46] - Log4j >> found with jmx enabled. >> 2020-02-19 04:23:27,774 [myid:1] - INFO [PurgeTask:FileTxnSnapLog@115] - >> zookeeper.snapshot.trust.empty : false >> 2020-02-19 04:23:27,780 [myid:1] - INFO >> [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed. >> 2020-02-19 04:23:27,781 [myid:1] - INFO [main:QuorumPeerMain@141] - >> Starting quorum peer >> 2020-02-19 04:23:27,786 [myid:1] - INFO [main:ServerCnxnFactory@135] - >> Using org.apache.zookeeper.server.NIOServerCnxnFactory as server connection >> factory >> 2020-02-19 04:23:27,788 [myid:1] - INFO [main:NIOServerCnxnFactory@673] >> - Configuring NIO connection handler with 10s sessionless connection >> timeout, 2 selector thread(s), 32 worker threads, and 64 kB direct buffers. >> 2020-02-19 04:23:27,791 [myid:1] - INFO [main:NIOServerCnxnFactory@686] >> - binding to port 0.0.0.0/0.0.0.0:2181 >> 2020-02-19 <http://0.0.0.0/0.0.0.0:21812020-02-19> 04:23:27,809 [myid:1] >> - INFO [main:Log@169] - Logging initialized @249ms to >> org.eclipse.jetty.util.log.Slf4jLog >> 2020-02-19 04:23:27,913 [myid:1] - WARN [main:ContextHandler@1520] - >> o.e.j.s.ServletContextHandler@5abca1e0{/,null,UNAVAILABLE} contextPath >> ends with /* >> 2020-02-19 04:23:27,913 [myid:1] - WARN [main:ContextHandler@1531] - >> Empty contextPath >> 2020-02-19 04:23:27,922 [myid:1] - INFO [main:X509Util@79] - Setting -D >> jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated >> TLS renegotiation >> 2020-02-19 04:23:27,923 [myid:1] - INFO [main:FileTxnSnapLog@115] - >> zookeeper.snapshot.trust.empty : false >> 2020-02-19 04:23:27,923 [myid:1] - INFO [main:QuorumPeer@1470] - Local >> sessions disabled >> 2020-02-19 04:23:27,923 [myid:1] - INFO [main:QuorumPeer@1481] - Local >> session upgrading disabled >> 2020-02-19 04:23:27,923 [myid:1] - INFO [main:QuorumPeer@1448] - >> tickTime set to 2000 >> 2020-02-19 04:23:27,923 [myid:1] - INFO [main:QuorumPeer@1492] - >> minSessionTimeout set to 4000 >> 2020-02-19 04:23:27,924 [myid:1] - INFO [main:QuorumPeer@1503] - >> maxSessionTimeout set to 40000 >> 2020-02-19 04:23:27,924 [myid:1] - INFO [main:QuorumPeer@1518] - >> initLimit set to 30 >> 2020-02-19 04:23:27,932 [myid:1] - INFO [main:ZKDatabase@117] - >> zookeeper.snapshotSizeFactor = 0.33 >> 2020-02-19 04:23:27,933 [myid:1] - INFO [main:QuorumPeer@1763] - Using >> insecure (non-TLS) quorum communication >> 2020-02-19 04:23:27,933 [myid:1] - INFO [main:QuorumPeer@1769] - Port >> unification disabled >> 2020-02-19 04:23:27,933 [myid:1] - INFO [main:QuorumPeer@2136] - >> QuorumPeer communication is not secured! (SASL auth disabled) >> 2020-02-19 04:23:27,933 [myid:1] - INFO [main:QuorumPeer@2165] - >> quorum.cnxn.threads.size set to 20 >> 2020-02-19 04:23:27,934 [myid:1] - INFO [main:FileSnap@83] - Reading >> snapshot /opt/zookeeper/data/version-2/snapshot.90000043e >> 2020-02-19 04:23:27,963 [myid:1] - INFO [main:Server@359] - >> jetty-9.4.24.v20191120; built: 2019-11-20T21:37:49.771Z; git: >> 363d5f2df3a8a28de40604320230664b9c793c16; jvm >> 1.8.0_111-internal-alpine-r0-b14 >> 2020-02-19 04:23:27,989 [myid:1] - INFO [main:DefaultSessionIdManager@333] >> - DefaultSessionIdManager workerName=node0 >> 2020-02-19 04:23:27,989 [myid:1] - INFO [main:DefaultSessionIdManager@338] >> - No SessionScavenger set, using defaults >> 2020-02-19 04:23:27,990 [myid:1] - INFO [main:HouseKeeper@140] - node0 >> Scavenging every 660000ms >> 2020-02-19 04:23:27,997 [myid:1] - INFO [main:ContextHandler@825] - >> Started o.e.j.s.ServletContextHandler@5abca1e0{/,null,AVAILABLE} >> 2020-02-19 04:23:28,004 [myid:1] - INFO [main:AbstractConnector@330] - >> Started ServerConnector@2b98378d{HTTP/1.1,[http/1.1]}{0.0.0.0:8080} >> 2020-02-19 04:23:28,004 [myid:1] - INFO [main:Server@399] - Started >> @444ms >> 2020-02-19 04:23:28,004 [myid:1] - INFO [main:JettyAdminServer@112] - >> Started AdminServer on address 0.0.0.0, port 8080 and command URL /commands >> 2020-02-19 04:23:28,007 [myid:1] - INFO >> [main:QuorumCnxManager$Listener@867] - Election port bind maximum retries >> is 1000 >> 2020-02-19 04:23:28,007 [myid:1] - INFO >> [QuorumPeerListener:QuorumCnxManager$Listener@917] - My election bind >> port: /0.0.0.0:3888 >> 2020-02-19 04:23:28,014 [myid:1] - INFO >> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] >> - LOOKING >> 2020-02-19 04:23:28,015 [myid:1] - INFO >> [QuorumPeer[myid=1](plain=0.0.0.0:2181 >> )(secure=disabled):FastLeaderElection@885] - New election. My id = 1, >> proposed zxid=0xa000000fb >> 2020-02-19 04:23:28,018 [myid:1] - INFO >> [WorkerSender[myid=1]:QuorumCnxManager@438] - Have smaller server >> identifier, so dropping the connection: (2, 1) >> 2020-02-19 04:23:28,019 [myid:1] - INFO >> [WorkerSender[myid=1]:QuorumCnxManager@438] - Have smaller server >> identifier, so dropping the connection: (3, 1) >> 2020-02-19 04:23:28,019 [myid:1] - INFO >> [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 >> (message format version), 1 (n.leader), 0xa000000fb (n.zxid), 0x1 >> (n.round), LOOKING (n.state), 1 (n.sid), 0xa (n.peerEPoch), LOOKING (my >> state)0 (n.config version) >> 2020-02-19 04:23:28,221 [myid:1] - INFO >> [QuorumPeer[myid=1](plain=0.0.0.0:2181 >> )(secure=disabled):QuorumCnxManager@438] - Have smaller server >> identifier, so dropping the connection: (2, 1) >> 2020-02-19 04:23:28,222 [myid:1] - INFO >> [QuorumPeer[myid=1](plain=0.0.0.0:2181 >> )(secure=disabled):QuorumCnxManager@438] - Have smaller server >> identifier, so dropping the connection: (3, 1) >> 2020-02-19 04:23:28,222 [myid:1] - INFO >> [QuorumPeer[myid=1](plain=0.0.0.0:2181 >> )(secure=disabled):FastLeaderElection@919] - Notification time out: 400 >> 2020-02-19 04:23:28,623 [myid:1] - INFO >> [QuorumPeer[myid=1](plain=0.0.0.0:2181 >> )(secure=disabled):QuorumCnxManager@438] - Have smaller server >> identifier, so dropping the connection: (2, 1) >> 2020-02-19 04:23:28,624 [myid:1] - INFO >> [QuorumPeer[myid=1](plain=0.0.0.0:2181 >> )(secure=disabled):QuorumCnxManager@438] - Have smaller server >> identifier, so dropping the connection: (3, 1) >> 2020-02-19 04:23:28,624 [myid:1] - INFO >> [QuorumPeer[myid=1](plain=0.0.0.0:2181 >> )(secure=disabled):FastLeaderElection@919] - Notification time out: 800 >> … >> >> And for the leader: >> …. >> 2020-02-19 05:02:10,341 [myid:3] - INFO >> [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: 2 >> (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 >> (n.round), LOOKING (n.state), 3 (n.sid), 0xb (n.peerEPoch), LEADING (my >> state)0 (n.config version) >> 2020-02-19 05:02:10,341 [myid:3] - INFO >> [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: 2 >> (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 >> (n.round), LEADING (n.state), 3 (n.sid), 0xc (n.peerEPoch), LEADING (my >> state)0 (n.config version) >> 2020-02-19 05:02:33,640 [myid:3] - WARN >> [NIOWorkerThread-4:NIOServerCnxn@366] - Unable to read additional data >> from client sessionid 0x30002ba40710018, likely client has closed socket >> 2020-02-19 05:02:39,047 [myid:3] - INFO >> [SessionTracker:ZooKeeperServer@398] - Expiring session >> 0x20002ba2fd6001a, timeout of 40000ms exceeded >> 2020-02-19 05:02:39,048 [myid:3] - INFO >> [SessionTracker:QuorumZooKeeperServer@157] - Submitting global >> closeSession request for session 0x20002ba2fd6001a >> 2020-02-19 05:03:10,340 [myid:3] - INFO [/0.0.0.0:3888 >> :QuorumCnxManager$Listener@924] - Received connection request >> 10.0.1.152:52492 >> 2020-02-19 05:03:10,340 [myid:3] - WARN >> [SendWorker:1:QuorumCnxManager$SendWorker@1143] - Interrupted while >> waiting for message on queue >> java.lang.InterruptedException >> at >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) >> at >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) >> at >> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) >> at >> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294) >> at >> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82) >> at >> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131) >> 2020-02-19 05:03:10,340 [myid:3] - WARN >> [SendWorker:1:QuorumCnxManager$SendWorker@1153] - Send worker leaving >> thread id 1 my id = 3 >> 2020-02-19 05:03:10,340 [myid:3] - WARN >> [RecvWorker:1:QuorumCnxManager$RecvWorker@1227] - Connection broken for >> id 1, my id = 3, error = >> java.net.SocketException: Socket closed >> at java.net.SocketInputStream.socketRead0(Native Method) >> at >> java.net.SocketInputStream.socketRead(SocketInputStream.java:116) >> at >> java.net.SocketInputStream.read(SocketInputStream.java:170) >> at >> java.net.SocketInputStream.read(SocketInputStream.java:141) >> at >> java.io.BufferedInputStream.fill(BufferedInputStream.java:246) >> at >> java.io.BufferedInputStream.read(BufferedInputStream.java:265) >> at >> java.io.DataInputStream.readInt(DataInputStream.java:387) >> at >> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1212) >> 2020-02-19 05:03:10,341 [myid:3] - WARN >> [RecvWorker:1:QuorumCnxManager$RecvWorker@1230] - Interrupting SendWorker >> 2020-02-19 05:03:10,340 [myid:3] - WARN >> [RecvWorker:3:QuorumCnxManager$RecvWorker@1227] - Connection broken for >> id 3, my id = 3, error = >> java.io.EOFException >> at >> java.io.DataInputStream.readInt(DataInputStream.java:392) >> at >> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1212) >> 2020-02-19 05:03:10,341 [myid:3] - WARN >> [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting SendWorker >> 2020-02-19 05:03:10,341 [myid:3] - INFO [/0.0.0.0:3888 >> :QuorumCnxManager$Listener@924] - Received connection request >> 10.0.1.142:46326 >> 2020-02-19 05:03:10,343 [myid:3] - WARN >> [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while >> waiting for message on queue >> java.lang.InterruptedException >> at >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) >> at >> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) >> at >> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) >> at >> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294) >> at >> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82) >> at >> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131) >> 2020-02-19 05:03:10,344 [myid:3] - WARN >> [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving >> thread id 3 my id = 3 >> 2020-02-19 05:03:10,344 [myid:3] - INFO >> [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: 2 >> (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 >> (n.round), LOOKING (n.state), 3 (n.sid), 0xb (n.peerEPoch), LEADING (my >> state)0 (n.config version) >> 2020-02-19 05:03:10,345 [myid:3] - INFO >> [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: 2 >> (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 >> (n.round), LEADING (n.state), 3 (n.sid), 0xc (n.peerEPoch), LEADING (my >> state)0 (n.config version) >> 2020-02-19 05:03:11,048 [myid:3] - INFO >> [SessionTracker:ZooKeeperServer@398] - Expiring session >> 0x30002ba40710018, timeout of 40000ms exceeded >> 2020-02-19 05:03:11,048 [myid:3] - INFO >> [SessionTracker:QuorumZooKeeperServer@157] - Submitting global >> closeSession request for session 0x30002ba40710018 >> >> All of the instances have a similar zoo.cfg: >> >> bash-4.3# cat conf/zoo.cfg >> # The number of milliseconds of each tick >> tickTime=2000 >> # The number of ticks that the initial >> # synchronization phase can take >> initLimit=30 >> # The number of ticks that can pass between >> # sending a request and getting an acknowledgement >> syncLimit=5 >> # Purge every 24 hours >> autopurge.purgeInterval=24 >> # the directory where the snapshot is stored. >> # do not use /tmp for storage, /tmp here is just >> # example sakes. >> dataDir=/opt/zookeeper/data >> # the port at which the clients will connect >> clientPort=2181 >> #Append other confg... >> electionPortBindRetry=1000 >> 4lw.commands.whitelist=stat, ruok, conf, mntr >> >> server.1=zookeeper1:2888:3888 >> server.2=zookeeper2:2888:3888 >> server.3=0.0.0.0:2888:3888 >>