[ https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058962#comment-17058962 ]
Dai Shi commented on ZOOKEEPER-3756: ------------------------------------ Here are the kubernetes files: [^zookeeper.yaml] [^configmap.yaml] The zookeeper yaml is slightly redacted to remove some more sensitive information, but it should only be cosmetic. Here are the docker image build files: [^Dockerfile] [^docker-entrypoint.sh] [^jmx.yaml] Note that while docker-entrypoint.sh tries to build the config file if it doesn't exist, I have since updated our kubernetes config to supply a configmap with the config file, so the entrypoint doesn't need to generate the config file anymore. My gut feeling says this should be able to be reproduced in minikube on a single machine, but I have not tried. Please let me know if you have any questions regarding these files or our setup. > Members failing to rejoin quorum > -------------------------------- > > Key: ZOOKEEPER-3756 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756 > Project: ZooKeeper > Issue Type: Improvement > Components: leaderElection > Affects Versions: 3.5.6, 3.5.7 > Reporter: Dai Shi > Assignee: Mate Szalay-Beko > Priority: Major > Attachments: Dockerfile, configmap.yaml, docker-entrypoint.sh, > jmx.yaml, zookeeper.yaml > > > Not sure if this is the place to ask, please close if it's not. > I am seeing some behavior that I can't explain since upgrading to 3.5: > In a 5 member quorum, when server 3 is the leader and each server has this in > their configuration: > {code:java} > server.1=100.71.255.254:2888:3888:participant;2181 > server.2=100.71.255.253:2888:3888:participant;2181 > server.3=100.71.255.252:2888:3888:participant;2181 > server.4=100.71.255.251:2888:3888:participant;2181 > server.5=100.71.255.250:2888:3888:participant;2181{code} > If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in > the logs: > {code:java} > 2020-03-11 20:23:35,720 [myid:2] - INFO > [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - > LOOKING > 2020-03-11 20:23:35,721 [myid:2] - INFO > [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885] > - New election. My id = 2, proposed zxid=0x1b8005f4bba > 2020-03-11 20:23:35,733 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (3, 2) > 2020-03-11 20:23:35,734 [myid:2] - INFO > [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection > request 100.126.116.201:36140 > 2020-03-11 20:23:35,735 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (4, 2) > 2020-03-11 20:23:35,740 [myid:2] - INFO > [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, > so dropping the connection: (5, 2) > 2020-03-11 20:23:35,740 [myid:2] - INFO > [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection > request 100.126.116.201:36142 > 2020-03-11 20:23:35,740 [myid:2] - INFO > [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message > format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING > (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config > version) > 2020-03-11 20:23:35,742 [myid:2] - WARN > [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting > for message on queue > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > at > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131) > 2020-03-11 20:23:35,744 [myid:2] - WARN > [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread > id 3 my id = 2 > 2020-03-11 20:23:35,745 [myid:2] - WARN > [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting > SendWorker{code} > The only way I can seem to get them to rejoin the quorum is to restart the > leader. > However, if I remove server 4 and 5 from the configuration of server 1 or 2 > (so only servers 1, 2, and 3 remain in the configuration file), then they can > rejoin the quorum fine. Is this expected and am I doing something wrong? Any > help or explanation would be greatly appreciated. Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)