Tifenn LE GOFF created ZOOKEEPER-4371: -----------------------------------------
Summary: False ID conflict when ZK try to connect to cluster Key: ZOOKEEPER-4371 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4371 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.6.2 Environment: We are on Zookeeper 3.6.2 on Docker with official image. It's can be reproduce with something like {code:java} docker run --add-host=zk_fqdn:zk_ip --ulimit nofile=64000:64000 -p ip_zk:2181:2181 -p ip_zk:2888:2888 -p ip_zk:3888:3888 -p ip_zk:7000:7000 -p ip_zk:8080:8080 -v /data/zookeeper/data:/data -v /data/zookeeper/log:/datalog --hostname "zk_fqdn" --env-file "/data/zookeeper/conf/zk.env"--name zookeeper zookeeper:3.6.2 {code} with {code:java} ZOO_MY_ID=zk_id ZOO_INIT_LIMIT=10 ZOO_SYNC_LIMIT=5 ZOO_MAX_CLIENT_CNXNS=0 ZOO_4LW_COMMANDS_WHITELIST=stat,mntr,conf,ruok ZOO_STANDALONE_ENABLED=False ZOO_CFG_EXTRA=metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider metricsProvider.httpPort=7000 metricsProvider.exportJvmInfo=true ZOO_SERVERS=server.1=zk1_fqdn:2888:3888;2181 server.2=zk2_fqdn:2888:3888;2181 server.3=zk3_fqdn:2888:3888;2181 {code} Reporter: Tifenn LE GOFF Some ZK cannot join cluster after moment {code:java} echo stat|nc $HOSTNAME 2181 This ZooKeeper instance is not currently serving requests {code} We have 3 ZK, zk1 with id1, zk2 with id2 and zk3 with id3. ZK2 and ZK3 are already running. When ZK1 connect to ZK, we have {code:java} 2021-09-07 13:33:09,585 [myid:1] - INFO [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@979] - Notification time out: 60000 2021-09-07 13:33:09,586 [myid:1] - INFO [QuorumConnectionThread-[myid=1]-55:QuorumCnxManager@513] - Have smaller server identifier, so dropping the connection: (myId:1 --> sid:2) 2021-09-07 13:33:09,586 [myid:1] - INFO [QuorumConnectionThread-[myid=1]-56:QuorumCnxManager@513] - Have smaller server identifier, so dropping the connection: (myId:1 --> sid:3) 2021-09-07 13:33:30,269 [myid:1] - WARN [NIOWorkerThread-1:NIOServerCnxn@373] - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) 2021-09-07 13:33:30,941 [myid:1] - WARN [NIOWorkerThread-2:NIOServerCnxn@373] - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$W{code} and on zk2 (actual leader), we have {code:java} 2021-09-07 13:33:09,587 [myid:2] - INFO [ListenerHandler-fqdn-zk2/172.17.0.2:3888:QuorumCnxManager$Listener$ListenerHandler@1070] - Received connection request from /ip-zk1:53102 2021-09-07 13:33:09,588 [myid:2] - WARN [SendWorker:1:QuorumCnxManager$SendWorker@1281] - Interrupted while waiting for message on queue java.lang.InterruptedException at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown Source) at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source) at org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1446) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:98) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1270) 2021-09-07 13:33:09,588 [myid:2] - WARN [SendWorker:1:QuorumCnxManager$SendWorker@1293] - Send worker leaving thread id 1 my id = 2 2021-09-07 13:33:09,588 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker@1395] - Connection broken for id 1, my id = 2 java.net.SocketException: Socket closed at java.base/java.net.SocketInputStream.socketRead0(Native Method) at java.base/java.net.SocketInputStream.socketRead(Unknown Source) at java.base/java.net.SocketInputStream.read(Unknown Source) at java.base/java.net.SocketInputStream.read(Unknown Source) at java.base/java.io.BufferedInputStream.fill(Unknown Source) at java.base/java.io.BufferedInputStream.read(Unknown Source) at java.base/java.io.DataInputStream.readInt(Unknown Source) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1383) 2021-09-07 13:33:09,589 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker@1401] - Interrupting SendWorker thread from RecvWorker. sid: 1. myId: 2 2021-09-07 13:33:09,589 [myid:2] - INFO [ListenerHandler-fqdn-zk2/172.17.0.2:3888:QuorumCnxManager$Listener$ListenerHandler@1070] - Received connection request from /172.17.0.2:35380 2021-09-07 13:33:09,590 [myid:2] - WARN [ListenerHandler-fqdn-zk2/172.17.0.2:3888:QuorumCnxManager@662] - We got a connection request from a server with our own ID. This should be either a configuration error, or a bug. {code} If we restart leader, it works. This issue happen very often since we have migrate our ZK services on docker instances. -- This message was sent by Atlassian Jira (v8.3.4#803005)