[jira] [Commented] (ZOOKEEPER-3036) Unexpected exception in zookeeper
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306375#comment-17306375 ] Lasaro Camargos commented on ZOOKEEPER-3036: This issue still happens in 3.5.8, on a 3 node cluster. Are there any plans to address this issue? > Unexpected exception in zookeeper > - > > Key: ZOOKEEPER-3036 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3036 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server >Affects Versions: 3.4.10 > Environment: 3 Zookeepers, 5 kafka servers >Reporter: Oded >Priority: Critical > > We got an issue with one of the zookeeprs (Leader), causing the entire kafka > cluster to fail: > 2018-05-09 02:29:01,730 [myid:3] - ERROR > [LearnerHandler-/192.168.0.91:42490:LearnerHandler@648] - Unexpected > exception causing shutdown while sock still open > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) > at > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559) > 2018-05-09 02:29:01,730 [myid:3] - WARN > [LearnerHandler-/192.168.0.91:42490:LearnerHandler@661] - *** GOODBYE > /192.168.0.91:42490 > > We would expect that zookeeper will choose another Leader and the Kafka > cluster will continue to work as expected, but that was not the case. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3775) Wrong message in IOException
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078718#comment-17078718 ] Lasaro Camargos commented on ZOOKEEPER-3775: Thank you, [~phunt]. But [~shireennagdive] has already provided the PR and is probably going to be more active in the project than me. Could you add her as a contributor as well so I can reassign to her? Regards > Wrong message in IOException > > > Key: ZOOKEEPER-3775 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3775 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Reporter: Lasaro Camargos >Assignee: Lasaro Camargos >Priority: Trivial > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > method run of QuorumCnxManager throws the following exception: > if (length <= 0 || length > PACKETMAXSIZE) { > throw new IOException("Received packet with invalid packet: " + length); > } > Instead of the current string, the cause should be "Received packet with > invalid length: " -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3775) Wrong message in IOException
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078609#comment-17078609 ] Lasaro Camargos commented on ZOOKEEPER-3775: Hi Shireen. I am not well versed in the processes followed by this project. [~phunt], as an active member, could you chip in here? Maybe assign the bug and trigger the build? Cheers > Wrong message in IOException > > > Key: ZOOKEEPER-3775 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3775 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Reporter: Lasaro Camargos >Priority: Trivial > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > method run of QuorumCnxManager throws the following exception: > if (length <= 0 || length > PACKETMAXSIZE) { > throw new IOException("Received packet with invalid packet: " + length); > } > Instead of the current string, the cause should be "Received packet with > invalid length: " -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3775) Wrong message in IOException
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17075945#comment-17075945 ] Lasaro Camargos commented on ZOOKEEPER-3775: Hi Shireen. I do not have the credentials needed to assign the JIRA. But given that it is a fairly simple issue, I would recommend that you create a PR and post it here and someone else will do it. Cheers. > Wrong message in IOException > > > Key: ZOOKEEPER-3775 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3775 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Reporter: Lasaro Camargos >Priority: Trivial > > method run of QuorumCnxManager throws the following exception: > if (length <= 0 || length > PACKETMAXSIZE) { > throw new IOException("Received packet with invalid packet: " + length); > } > Instead of the current string, the cause should be "Received packet with > invalid length: " -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074812#comment-17074812 ] Lasaro Camargos commented on ZOOKEEPER-3769: thanks for driving this, [~symat] > fast leader election does not end if leader is taken down > - > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.7 >Reporter: Lasaro Camargos >Assignee: Mate Szalay-Beko >Priority: Major > Labels: pull-request-available > Fix For: 3.6.1, 3.5.8 > > Attachments: node1.log, node2.log, node3.log > > Time Spent: 3h 10m > Remaining Estimate: 0h > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/company/service/data > dataLogDir=/company/service/log > clientPort=2181 > snapCount=10 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could > be causing it. > > In the logs, node3 is killed at 11:17:14 > node2 is killed at 11:17:50 2 and node 1 at 11:18:02 > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3775) Wrong message in IOException
Lasaro Camargos created ZOOKEEPER-3775: -- Summary: Wrong message in IOException Key: ZOOKEEPER-3775 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3775 Project: ZooKeeper Issue Type: Bug Components: leaderElection Reporter: Lasaro Camargos method run of QuorumCnxManager throws the following exception: if (length <= 0 || length > PACKETMAXSIZE) { throw new IOException("Received packet with invalid packet: " + length); } Instead of the current string, the cause should be "Received packet with invalid length: " -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070311#comment-17070311 ] Lasaro Camargos commented on ZOOKEEPER-3769: I had to backtrack on it happening with Netty. The factory was misconfigured and it was actually running on NIO. Regarding the version, I tried 3.5.5 and 3.5.7. Lásaro On Sun, Mar 29, 2020 at 4:26 AM ASF GitHub Bot (Jira) > fast leader election does not end if leader is taken down > - > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.7 >Reporter: Lasaro Camargos >Assignee: Mate Szalay-Beko >Priority: Major > Labels: pull-request-available > Fix For: 3.6.1, 3.5.8 > > Attachments: node1.log, node2.log, node3.log > > Time Spent: 1h 10m > Remaining Estimate: 0h > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/company/service/data > dataLogDir=/company/service/log > clientPort=2181 > snapCount=10 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could > be causing it. > > In the logs, node3 is killed at 11:17:14 > node2 is killed at 11:17:50 2 and node 1 at 11:18:02 > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068985#comment-17068985 ] Lasaro Camargos commented on ZOOKEEPER-3769: [~symat], thanks for the updated patch. I gave it a spin and it is working, as in its not regressing anything else. I cannot confirm that it handles the issue I had as I still haven't managed to reproduce. Trying to answer your questions, # There is nothing particular to this setup; all are physical boxes, running on the same network, OS (centos 7) and java version (12) # During the time the problem reproduced, I had multiple runs in which I just restarted the service, but also runs in which I cleaned the setup. It consistently reproduced, until it didn't. Whatever it was, it doesn't seem related to the snapshots. # Regarding dynamic reconfiguration, no, I haven't used it in this setup. # You had asked me if I had tried Netty. Please ignore my previous response. I didn't try it while the problem still reproduced. Even if I cannot reproduce, I still think this is a fix worth having. Please submit the PR. Should I change the Jira name to better reflect what actually happened? > fast leader election does not end if leader is taken down > - > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.7 >Reporter: Lasaro Camargos >Assignee: Mate Szalay-Beko >Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/company/service/data > dataLogDir=/company/service/log > clientPort=2181 > snapCount=10 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could > be causing it. > > In the logs, node3 is killed at 11:17:14 > node2 is killed at 11:17:50 2 and node 1 at 11:18:02 > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lasaro Camargos updated ZOOKEEPER-3769: --- Description: In a cluster with three nodes, node3 is the leader and the other nodes are followers. If I stop node3, the other two nodes do not finish the leader election. This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and this config tickTime=2000 initLimit=30 syncLimit=3 dataDir=/company/service/data dataLogDir=/company/service/log clientPort=2181 snapCount=10 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 skipACL=yes preAllocSize=65536 maxClientCnxns=0 4lw.commands.whitelist=* admin.enableServer=false server.1=companydemo1.snc4.companyinc.com:3000:4000 server.2=companydemo2.snc4.companyinc.com:3000:4000 server.3=companydemo3.snc4.companyinc.com:3000:4000 Could you have a look at the logs and help me figure this out? It seems like node 1 is not getting notifications back from node2, but I don't see anything wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could be causing it. In the logs, node3 is killed at 11:17:14 node2 is killed at 11:17:50 2 and node 1 at 11:18:02 was: In a cluster with three nodes, node3 is the leader and the other nodes are followers. If I stop node3, the other two nodes do not finish the leader election. This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and this config tickTime=2000 initLimit=30 syncLimit=3 dataDir=/hedvig/hpod/data dataLogDir=/hedvig/hpod/log clientPort=2181 snapCount=10 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 skipACL=yes preAllocSize=65536 maxClientCnxns=0 4lw.commands.whitelist=* admin.enableServer=false server.1=companydemo1.snc4.companyinc.com:3000:4000 server.2=companydemo2.snc4.companyinc.com:3000:4000 server.3=companydemo3.snc4.companyinc.com:3000:4000 Could you have a look at the logs and help me figure this out? It seems like node 1 is not getting notifications back from node2, but I don't see anything wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could be causing it. In the logs, node3 is killed at 11:17:14 node2 is killed at 11:17:50 2 and node 1 at 11:18:02 > fast leader election does not end if leader is taken down > - > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.7 >Reporter: Lasaro Camargos >Assignee: Mate Szalay-Beko >Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/company/service/data > dataLogDir=/company/service/log > clientPort=2181 > snapCount=10 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could > be causing it. > > In the logs, node3 is killed at 11:17:14 > node2 is killed at 11:17:50 2 and node 1 at 11:18:02 > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068032#comment-17068032 ] Lasaro Camargos commented on ZOOKEEPER-3769: I went back and looked into some older logs and could confirm that the WorkerReceiver died and that's what caused the election to hang. However, the BufferUnderflowException was present in very few instances. Most of the time, it was a NegativeArraySizeException that was caught, but pretty much in the same situation, that is, after the connection being broken to node3. The following are excerpts from node1 and node 3. Let me know if you would like to have a look at the full logs. 03/23/20 10:14:45,772 [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] INFO [org.apache.zookeeper.server.ZooKeeperServer] (ZooKeeperServer.java:166) - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir /company/service/log/version-2 snapdir /company/service/data/version-2 03/23/20 10:14:45,772 [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] INFO [org.apache.zookeeper.server.quorum.Learner] (Follower.java:69) - FOLLOWING - LEADER ELECTION TOOK - 9 MS 03/23/20 10:14:45,774 [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] DEBUG [org.apache.zookeeper.server.quorum.QuorumPeer] (QuorumPeer.java:202) - Resolved address for companydemo3.snc4.companyinc.com: companydemo3.snc4.companyinc.com/172.22.64.148 03/23/20 10:14:45,793 [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] TRACE [org.apache.zookeeper.server.quorum.Learner] (ZooTrace.java:71) - i UNKNOWN17 5 null 03/23/20 10:14:45,798 [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] TRACE [org.apache.zookeeper.server.quorum.Learner] (ZooTrace.java:71) - i DIFF 4001f null 03/23/20 10:14:45,799 [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] INFO [org.apache.zookeeper.server.quorum.Learner] (Learner.java:391) - Getting a diff from the leader 0x4001f 03/23/20 10:14:45,801 [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] TRACE [org.apache.zookeeper.server.quorum.Learner] (ZooTrace.java:71) - i NEWLEADER 5 null 03/23/20 10:14:45,801 [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] INFO [org.apache.zookeeper.server.quorum.Learner] (Learner.java:546) - Learner received NEWLEADER message 03/23/20 10:14:45,815 [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] TRACE [org.apache.zookeeper.server.quorum.Learner] (ZooTrace.java:71) - i UPTODATE null 03/23/20 10:14:45,816 [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] INFO [org.apache.zookeeper.server.quorum.Learner] (Learner.java:529) - Learner received UPTODATE message 03/23/20 10:14:45,816 [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] DEBUG [org.apache.zookeeper.server.quorum.QuorumPeer] (QuorumPeer.java:1916) - Reconfig feature is disabled, skip reconfig processing. 03/23/20 10:14:45,817 [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled)] INFO [org.apache.zookeeper.server.quorum.CommitProcessor] (CommitProcessor.java:256) - Configuring CommitProcessor with 32 worker threads. 03/23/20 10:14:46,064 [companydemo1.snc4.companyinc.com/172.22.65.65:4000] INFO [org.apache.zookeeper.server.quorum.QuorumCnxManager] (QuorumCnxManager.java:924) - Received connection request 172.22.30.98:58472 03/23/20 10:14:46,064 [companydemo1.snc4.companyinc.com/172.22.65.65:4000] DEBUG [org.apache.zookeeper.server.quorum.QuorumCnxManager] (QuorumCnxManager.java:1038) - Address of remote peer: 3 03/23/20 10:14:46,064 [companydemo1.snc4.companyinc.com/172.22.65.65:4000] DEBUG [org.apache.zookeeper.server.quorum.QuorumCnxManager] (QuorumCnxManager.java:1055) - Calling finish for 3 03/23/20 10:14:46,064 [companydemo1.snc4.companyinc.com/172.22.65.65:4000] DEBUG [org.apache.zookeeper.server.quorum.QuorumCnxManager] (QuorumCnxManager.java:1072) - Removing entry from senderWorkerMap sid=3 03/23/20 10:14:46,065 [SendWorker:3] WARN [org.apache.zookeeper.server.quorum.QuorumCnxManager] (QuorumCnxManager.java:1143) - Interrupted while waiting for message on queue java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294) ~[zookeeper-3.5.7.jar:3.5.7] at org.apache.zookeeper.server.quorum.QuorumCnxManager.acce
[jira] [Comment Edited] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066943#comment-17066943 ] Lasaro Camargos edited comment on ZOOKEEPER-3769 at 3/25/20, 7:20 PM: -- Thank you for the analysis, [~symat]. Wrt to testing with NETTY, before trying SASL I did try just NETTY, but the behavior was exactly the same. Wrt to using an older JDK, I reverted all my changes to the configs and put back the original version, 3.5.5, but didn't get to try other JDK. The problem no longer reproduces and I am still trying to figure if/what I am missing that might have changed the setup. Regarding not handling the BufferUnderflowException properly, yes, it makes sense; the thread died and wasn't recreated so no more messages were ever received. was (Author: lasaro): Thank you for the analysis, [~symat]. Wrt to testing with NETTY, before trying SASL I did try just NETTY, but the behavior was exactly the same. Wrt to using an older JDK, I reverted all my changes to the configs and put back the original version, 3.5.5, but didn't get to try other JDK. The problem no longer reproduces and I am still trying to figure if/what I am missing that might have changed the setup. > fast leader election does not end if leader is taken down > - > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.7 >Reporter: Lasaro Camargos >Assignee: Mate Szalay-Beko >Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/hedvig/hpod/data > dataLogDir=/hedvig/hpod/log > clientPort=2181 > snapCount=10 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could > be causing it. > > In the logs, node3 is killed at 11:17:14 > node2 is killed at 11:17:50 2 and node 1 at 11:18:02 > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066943#comment-17066943 ] Lasaro Camargos commented on ZOOKEEPER-3769: Thank you for the analysis, [~symat]. Wrt to testing with NETTY, before trying SASL I did try just NETTY, but the behavior was exactly the same. Wrt to using an older JDK, I reverted all my changes to the configs and put back the original version, 3.5.5, but didn't get to try other JDK. The problem no longer reproduces and I am still trying to figure if/what I am missing that might have changed the setup. > fast leader election does not end if leader is taken down > - > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.7 >Reporter: Lasaro Camargos >Assignee: Mate Szalay-Beko >Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/hedvig/hpod/data > dataLogDir=/hedvig/hpod/log > clientPort=2181 > snapCount=10 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could > be causing it. > > In the logs, node3 is killed at 11:17:14 > node2 is killed at 11:17:50 2 and node 1 at 11:18:02 > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066245#comment-17066245 ] Lasaro Camargos edited comment on ZOOKEEPER-3769 at 3/24/20, 11:02 PM: --- After I enabled SASL in order to force the asynchronous creation of sockets, the problem no longer reproduces. Hence I am guessing this might be related to ZOOKEEPER-900 was (Author: lasaro): After I enabled SASL in order to force the asynchronous creation of sockets, the problem no longer reproduces. > fast leader election does not end if leader is taken down > - > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.7 >Reporter: Lasaro Camargos >Assignee: Mate Szalay-Beko >Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/hedvig/hpod/data > dataLogDir=/hedvig/hpod/log > clientPort=2181 > snapCount=10 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could > be causing it. > > In the logs, node3 is killed at 11:17:14 > node2 is killed at 11:17:50 2 and node 1 at 11:18:02 > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066245#comment-17066245 ] Lasaro Camargos commented on ZOOKEEPER-3769: After I enabled SASL in order to force the asynchronous creation of sockets, the problem no longer reproduces. > fast leader election does not end if leader is taken down > - > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.7 >Reporter: Lasaro Camargos >Assignee: Mate Szalay-Beko >Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/hedvig/hpod/data > dataLogDir=/hedvig/hpod/log > clientPort=2181 > snapCount=10 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could > be causing it. > > In the logs, node3 is killed at 11:17:14 > node2 is killed at 11:17:50 2 and node 1 at 11:18:02 > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066185#comment-17066185 ] Lasaro Camargos commented on ZOOKEEPER-3769: Just to complement on the behavior (not covered by the logs), if I bring node 3 back up, it becomes the leader, 2 a follower, and 1 does not finish the election. If I stop and restart node 1, then it joins the cluster successfully. It seems like the connection from 1 to 2 needs a "refresh" in order to work properly. > fast leader election does not end if leader is taken down > - > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.7 >Reporter: Lasaro Camargos >Assignee: Mate Szalay-Beko >Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/hedvig/hpod/data > dataLogDir=/hedvig/hpod/log > clientPort=2181 > snapCount=10 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could > be causing it. > > In the logs, node3 is killed at 11:17:14 > node2 is killed at 11:17:50 2 and node 1 at 11:18:02 > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lasaro Camargos updated ZOOKEEPER-3769: --- Description: In a cluster with three nodes, node3 is the leader and the other nodes are followers. If I stop node3, the other two nodes do not finish the leader election. This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and this config tickTime=2000 initLimit=30 syncLimit=3 dataDir=/hedvig/hpod/data dataLogDir=/hedvig/hpod/log clientPort=2181 snapCount=10 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 skipACL=yes preAllocSize=65536 maxClientCnxns=0 4lw.commands.whitelist=* admin.enableServer=false server.1=companydemo1.snc4.companyinc.com:3000:4000 server.2=companydemo2.snc4.companyinc.com:3000:4000 server.3=companydemo3.snc4.companyinc.com:3000:4000 Could you have a look at the logs and help me figure this out? It seems like node 1 is not getting notifications back from node2, but I don't see anything wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could be causing it. In the logs, node3 is killed at 11:17:14 node2 is killed at 11:17:50 2 and node 1 at 11:18:02 was: In a cluster with three nodes, node3 is the leader and the other nodes are followers. If I stop node3, the other two nodes do not finish the leader election. This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and this config tickTime=2000 initLimit=30 syncLimit=3 dataDir=/hedvig/hpod/data dataLogDir=/hedvig/hpod/log clientPort=2181 snapCount=10 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 skipACL=yes preAllocSize=65536 maxClientCnxns=0 4lw.commands.whitelist=* admin.enableServer=false server.1=companydemo1.snc4.companyinc.com:3000:4000 server.2=companydemo2.snc4.companyinc.com:3000:4000 server.3=companydemo3.snc4.companyinc.com:3000:4000 Could you have a look at the logs and help me figure this out? It seems like node 1 is not getting notifications back from node2, but I don't see anything wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could be causing it. > fast leader election does not end if leader is taken down > - > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.7 >Reporter: Lasaro Camargos >Assignee: Mate Szalay-Beko >Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/hedvig/hpod/data > dataLogDir=/hedvig/hpod/log > clientPort=2181 > snapCount=10 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could > be causing it. > > In the logs, node3 is killed at 11:17:14 > node2 is killed at 11:17:50 2 and node 1 at 11:18:02 > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ZOOKEEPER-3756) Members failing to rejoin quorum
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066127#comment-17066127 ] Lasaro Camargos commented on ZOOKEEPER-3756: Thanks for the feedback. I've opened ZOOKEEPER-3769 with a slightly different scenario but problematic in the same sense. To give the complete answer, I am not using 0.0.0.0 addresses (not explicitly, at least) and not using containers. [~symat], I appreciate your willingness to look into it. It's been troubling me for some time. > Members failing to rejoin quorum > > > Key: ZOOKEEPER-3756 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756 > Project: ZooKeeper > Issue Type: Improvement > Components: leaderElection >Affects Versions: 3.5.6, 3.5.7 >Reporter: Dai Shi >Assignee: Mate Szalay-Beko >Priority: Major > Labels: pull-request-available > Fix For: 3.6.1, 3.5.8 > > Attachments: Dockerfile, configmap.yaml, docker-entrypoint.sh, > jmx.yaml, zoo-0.log, zoo-1.log, zoo-2.log, zoo-service.yaml, zookeeper.yaml > > Time Spent: 3h > Remaining Estimate: 0h > > Not sure if this is the place to ask, please close if it's not. > I am seeing some behavior that I can't explain since upgrading to 3.5: > In a 5 member quorum, when server 3 is the leader and each server has this in > their configuration: > {code:java} > server.1=100.71.255.254:2888:3888:participant;2181 > server.2=100.71.255.253:2888:3888:participant;2181 > server.3=100.71.255.252:2888:3888:participant;2181 > server.4=100.71.255.251:2888:3888:participant;2181 > server.5=100.71.255.250:2888:3888:participant;2181{code} > If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in > the logs: > {code:java} > 2020-03-11 20:23:35,720 [myid:2] - INFO > [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - > LOOKING > 2020-03-11 20:23:35,721 [myid:2] - INFO > [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885] > - New election. My id = 2, proposed zxid=0x1b8005f4bba > 2020-03-11 20:23:35,733 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (3, 2) > 2020-03-11 20:23:35,734 [myid:2] - INFO > [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection > request 100.126.116.201:36140 > 2020-03-11 20:23:35,735 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (4, 2) > 2020-03-11 20:23:35,740 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (5, 2) > 2020-03-11 20:23:35,740 [myid:2] - INFO > [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection > request 100.126.116.201:36142 > 2020-03-11 20:23:35,740 [myid:2] - INFO > [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message > format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING > (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config > version) > 2020-03-11 20:23:35,742 [myid:2] - WARN > [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting > for message on queue > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > at > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131) > 2020-03-11 20:23:35,744 [myid:2] - WARN > [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread > id 3 my id = 2 > 2020-03-11 20:23:35,745 [myid:2] - WARN > [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting > SendWorker{code} > The only way I can seem to get them to rejoin the quorum is to restart the > leader. > However, if I remove server 4 and 5 from the configuration of server 1 or 2 > (so only servers 1, 2, and 3 remain in the configuration file), then they can > rejoin the quorum fine. Is this expected and am I doing something wrong? Any > help or explanation would be greatly appreciated. Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lasaro Camargos updated ZOOKEEPER-3769: --- Description: In a cluster with three nodes, node3 is the leader and the other nodes are followers. If I stop node3, the other two nodes do not finish the leader election. This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and this config tickTime=2000 initLimit=30 syncLimit=3 dataDir=/hedvig/hpod/data dataLogDir=/hedvig/hpod/log clientPort=2181 snapCount=10 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 skipACL=yes preAllocSize=65536 maxClientCnxns=0 4lw.commands.whitelist=* admin.enableServer=false server.1=companydemo1.snc4.companyinc.com:3000:4000 server.2=companydemo2.snc4.companyinc.com:3000:4000 server.3=companydemo3.snc4.companyinc.com:3000:4000 Could you have a look at the logs and help me figure this out? It seems like node 1 is not getting notifications back from node2, but I don't see anything wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could be causing it. was: In a cluster with three nodes, node3 is the leader and the other nodes are followers. If I stop node3, the other two nodes do not finish the leader election. This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and this config tickTime=2000 initLimit=30 syncLimit=3 dataDir=/hedvig/hpod/data dataLogDir=/hedvig/hpod/log clientPort=2181 snapCount=10 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 skipACL=yes preAllocSize=65536 maxClientCnxns=0 4lw.commands.whitelist=* admin.enableServer=false server.1=companydemo1.snc4.companyinc.com:3000:4000 server.2=companydemo2.snc4.companyinc.com:3000:4000 server.3=companydemo3.snc4.companyinc.com:3000:4000 Could you have a look at the logs and help me figure this out? It seems like node 1 is not getting notifications back from node2, but I don't see anything wrong with the network so I am wondering if bugs like > fast leader election does not end if leader is taken down > - > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.7 >Reporter: Lasaro Camargos >Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/hedvig/hpod/data > dataLogDir=/hedvig/hpod/log > clientPort=2181 > snapCount=10 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like ZOOKEEPER-3756 could > be causing it. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lasaro Camargos updated ZOOKEEPER-3769: --- Description: In a cluster with three nodes, node3 is the leader and the other nodes are followers. If I stop node3, the other two nodes do not finish the leader election. This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and this config tickTime=2000 initLimit=30 syncLimit=3 dataDir=/hedvig/hpod/data dataLogDir=/hedvig/hpod/log clientPort=2181 snapCount=10 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 skipACL=yes preAllocSize=65536 maxClientCnxns=0 4lw.commands.whitelist=* admin.enableServer=false server.1=companydemo1.snc4.companyinc.com:3000:4000 server.2=companydemo2.snc4.companyinc.com:3000:4000 server.3=companydemo3.snc4.companyinc.com:3000:4000 Could you have a look at the logs and help me figure this out? It seems like node 1 is not getting notifications back from node2, but I don't see anything wrong with the network so I am wondering if bugs like was: In a cluster with three nodes, node3 is the leader and the other nodes are followers. If I stop node3, the other two nodes do not finish the leader election. This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and this config tickTime=2000 initLimit=30 syncLimit=3 dataDir=/hedvig/hpod/data dataLogDir=/hedvig/hpod/log clientPort=2181 snapCount=10 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 skipACL=yes preAllocSize=65536 maxClientCnxns=0 4lw.commands.whitelist=* admin.enableServer=false server.1=companydemo1.snc4.companyinc.com:3000:4000 server.2=companydemo2.snc4.companyinc.com:3000:4000 server.3=companydemo3.snc4.companyinc.com:3000:4000 > fast leader election does not end if leader is taken down > - > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.7 >Reporter: Lasaro Camargos >Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/hedvig/hpod/data > dataLogDir=/hedvig/hpod/log > clientPort=2181 > snapCount=10 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > Could you have a look at the logs and help me figure this out? It seems like > node 1 is not getting notifications back from node2, but I don't see anything > wrong with the network so I am wondering if bugs like > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lasaro Camargos updated ZOOKEEPER-3769: --- Description: In a cluster with three nodes, node3 is the leader and the other nodes are followers. If I stop node3, the other two nodes do not finish the leader election. This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and this config tickTime=2000 initLimit=30 syncLimit=3 dataDir=/hedvig/hpod/data dataLogDir=/hedvig/hpod/log clientPort=2181 snapCount=10 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 skipACL=yes preAllocSize=65536 maxClientCnxns=0 4lw.commands.whitelist=* admin.enableServer=false server.1=companydemo1.snc4.companyinc.com:3000:4000 server.2=companydemo2.snc4.companyinc.com:3000:4000 server.3=companydemo3.snc4.companyinc.com:3000:4000 was: In a cluster with three nodes, node3 is the leader and the other nodes are followers. If I stop node3, the other two nodes do not finish the leader election. > fast leader election does not end if leader is taken down > - > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.7 >Reporter: Lasaro Camargos >Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > This is happening with ZK 3.5.7, openjdk version "12.0.2" 2019-07-16, and > this config > > tickTime=2000 > initLimit=30 > syncLimit=3 > dataDir=/hedvig/hpod/data > dataLogDir=/hedvig/hpod/log > clientPort=2181 > snapCount=10 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > skipACL=yes > preAllocSize=65536 > maxClientCnxns=0 > 4lw.commands.whitelist=* > admin.enableServer=false > server.1=companydemo1.snc4.companyinc.com:3000:4000 > server.2=companydemo2.snc4.companyinc.com:3000:4000 > server.3=companydemo3.snc4.companyinc.com:3000:4000 > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lasaro Camargos updated ZOOKEEPER-3769: --- Attachment: node1.log node2.log node3.log > fast leader election does not end if leader is taken down > - > > Key: ZOOKEEPER-3769 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.5.7 >Reporter: Lasaro Camargos >Priority: Major > Attachments: node1.log, node2.log, node3.log > > > In a cluster with three nodes, node3 is the leader and the other nodes are > followers. > If I stop node3, the other two nodes do not finish the leader election. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3769) fast leader election does not end if leader is taken down
Lasaro Camargos created ZOOKEEPER-3769: -- Summary: fast leader election does not end if leader is taken down Key: ZOOKEEPER-3769 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.5.7 Reporter: Lasaro Camargos In a cluster with three nodes, node3 is the leader and the other nodes are followers. If I stop node3, the other two nodes do not finish the leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ZOOKEEPER-3756) Members failing to rejoin quorum
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065017#comment-17065017 ] Lasaro Camargos edited comment on ZOOKEEPER-3756 at 3/23/20, 6:56 PM: -- Dear all, currently, I am consistently facing the following scenario while running 3.5.5 and 3.5.7, which I believe is related to this bug: 3 nodes up. Node 3 stop -> node 2 is elected; node 1 follows. Node 3 start -> node 3 elected the leader; node 2 follows; node 1 is unable to elect. Node 1 stop and start -> node 1 rejoins the quorum. Node 2 stop and start -> node 2 is unable to elect. Node 1 stop and start -> node 2 joins the quorum; node 1 joins the quorum Node 2 stop and start -> node 2 unable to join the quorum Node 3 stop and start -> node 3 elected the leader; node 2 follows; node 1 is unable to elect. Reducing the cnxTimeout value didn't change the behavior. I tested with this fix and now it is now worse; after a round of restarts, there doesn't seem to have anything I can to make node 1 join finish the election. This is such a nasty problem that I am wondering if there is something else to it. Maybe my configuration. Could you point me what would be useful in terms of information in order to debug this better? Full logs? was (Author: lasaro): Dear all, currently, I am consistently facing the following scenario while running 3.5.5 and 3.5.7, which I believe is related to this bug: 3 nodes up. Node 3 stop -> node 2 is elected; node 1 follows. Node 3 start -> node 3 elected the leader; node 2 follows; node 1 is unable to elect. Node 1 stop and start -> node 1 rejoins the quorum. Node 2 stop and start -> node 2 is unable to elect. Node 1 stop and start -> node 2 joins the quorum; node 1 joins the quorum Node 2 stop and start -> node 2 unable to join the quorum Node 3 stop and start -> node 3 elected the leader; node 2 follows; node 1 is unable to elect. Reducing the cnxTimeout value didn't change the behavior. I tested with this fix and now it is now worse; after a round of restarts, there doesn't seem to have anything I can to make node 1 join finish the election. > Members failing to rejoin quorum > > > Key: ZOOKEEPER-3756 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756 > Project: ZooKeeper > Issue Type: Improvement > Components: leaderElection >Affects Versions: 3.5.6, 3.5.7 >Reporter: Dai Shi >Assignee: Mate Szalay-Beko >Priority: Major > Labels: pull-request-available > Fix For: 3.6.1 > > Attachments: Dockerfile, configmap.yaml, docker-entrypoint.sh, > jmx.yaml, zoo-0.log, zoo-1.log, zoo-2.log, zoo-service.yaml, zookeeper.yaml > > Time Spent: 3h > Remaining Estimate: 0h > > Not sure if this is the place to ask, please close if it's not. > I am seeing some behavior that I can't explain since upgrading to 3.5: > In a 5 member quorum, when server 3 is the leader and each server has this in > their configuration: > {code:java} > server.1=100.71.255.254:2888:3888:participant;2181 > server.2=100.71.255.253:2888:3888:participant;2181 > server.3=100.71.255.252:2888:3888:participant;2181 > server.4=100.71.255.251:2888:3888:participant;2181 > server.5=100.71.255.250:2888:3888:participant;2181{code} > If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in > the logs: > {code:java} > 2020-03-11 20:23:35,720 [myid:2] - INFO > [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - > LOOKING > 2020-03-11 20:23:35,721 [myid:2] - INFO > [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885] > - New election. My id = 2, proposed zxid=0x1b8005f4bba > 2020-03-11 20:23:35,733 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (3, 2) > 2020-03-11 20:23:35,734 [myid:2] - INFO > [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection > request 100.126.116.201:36140 > 2020-03-11 20:23:35,735 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (4, 2) > 2020-03-11 20:23:35,740 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (5, 2) > 2020-03-11 20:23:35,740 [myid:2] - INFO > [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection > request 100.126.116.201:36142 > 2020-03-11 20:23:35,740 [myid:2] - INFO > [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message > format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING > (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.co
[jira] [Commented] (ZOOKEEPER-3756) Members failing to rejoin quorum
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065017#comment-17065017 ] Lasaro Camargos commented on ZOOKEEPER-3756: Dear all, currently, I am consistently facing the following scenario while running 3.5.5 and 3.5.7, which I believe is related to this bug: 3 nodes up. Node 3 stop -> node 2 is elected; node 1 follows. Node 3 start -> node 3 elected the leader; node 2 follows; node 1 is unable to elect. Node 1 stop and start -> node 1 rejoins the quorum. Node 2 stop and start -> node 2 is unable to elect. Node 1 stop and start -> node 2 joins the quorum; node 1 joins the quorum Node 2 stop and start -> node 2 unable to join the quorum Node 3 stop and start -> node 3 elected the leader; node 2 follows; node 1 is unable to elect. Reducing the cnxTimeout value didn't change the behavior. I tested with this fix and now it is now worse; after a round of restarts, there doesn't seem to have anything I can to make node 1 join finish the election. > Members failing to rejoin quorum > > > Key: ZOOKEEPER-3756 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756 > Project: ZooKeeper > Issue Type: Improvement > Components: leaderElection >Affects Versions: 3.5.6, 3.5.7 >Reporter: Dai Shi >Assignee: Mate Szalay-Beko >Priority: Major > Labels: pull-request-available > Fix For: 3.6.1 > > Attachments: Dockerfile, configmap.yaml, docker-entrypoint.sh, > jmx.yaml, zoo-0.log, zoo-1.log, zoo-2.log, zoo-service.yaml, zookeeper.yaml > > Time Spent: 3h > Remaining Estimate: 0h > > Not sure if this is the place to ask, please close if it's not. > I am seeing some behavior that I can't explain since upgrading to 3.5: > In a 5 member quorum, when server 3 is the leader and each server has this in > their configuration: > {code:java} > server.1=100.71.255.254:2888:3888:participant;2181 > server.2=100.71.255.253:2888:3888:participant;2181 > server.3=100.71.255.252:2888:3888:participant;2181 > server.4=100.71.255.251:2888:3888:participant;2181 > server.5=100.71.255.250:2888:3888:participant;2181{code} > If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in > the logs: > {code:java} > 2020-03-11 20:23:35,720 [myid:2] - INFO > [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - > LOOKING > 2020-03-11 20:23:35,721 [myid:2] - INFO > [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885] > - New election. My id = 2, proposed zxid=0x1b8005f4bba > 2020-03-11 20:23:35,733 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (3, 2) > 2020-03-11 20:23:35,734 [myid:2] - INFO > [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection > request 100.126.116.201:36140 > 2020-03-11 20:23:35,735 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (4, 2) > 2020-03-11 20:23:35,740 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (5, 2) > 2020-03-11 20:23:35,740 [myid:2] - INFO > [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection > request 100.126.116.201:36142 > 2020-03-11 20:23:35,740 [myid:2] - INFO > [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message > format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING > (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config > version) > 2020-03-11 20:23:35,742 [myid:2] - WARN > [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting > for message on queue > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > at > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131) > 2020-03-11 20:23:35,744 [myid:2] - WARN > [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread > id 3 my id = 2 > 2020-03-11 20:23:35,745 [myid:2] - WARN > [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interruptin