[
https://issues.apache.org/jira/browse/ZOOKEEPER-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864631#comment-17864631
]
luoxin commented on ZOOKEEPER-4724:
-----------------------------------
The inconsistency in the server list might be causing the problem. As server-1
becomes the leader, it synchronizes the current server list to server-2:{{{}{}}}
server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181
server.2=dev-dev2-zookeeper-1.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181
Upon receiving the updated server list, server-2 identifies server-1 as the
leader. Consequently, server-2 restarts election and attempts to connect to the
leader(server-1) using the new address 0.0.0.0:2888.
> follower can't connect to the right leader and quorum failed to form
> --------------------------------------------------------------------
>
> Key: ZOOKEEPER-4724
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4724
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.6.4
> Reporter: Luke Chen
> Priority: Major
>
> When entering "following - discovery" state, the follower will connect to the
> leader node to reach a quorum. But recently, a user faced the issue that the
> follower can't connect to the right leader and quorum failed to form. From
> the log, I can see the follower is trying to connect to itself
> (0.0.0.0:2888), instead of the leader. After 5 retries, a new election
> started, and all the things happen again, that is, the node becomes a
> follower, and try to connect to itself, and again, and again...
>
> The log is like this:
> {code:java}
> 2023-07-25 06:47:54,982 INFO FOLLOWING - LEADER ELECTION TOOK - 9802 MS
> (org.apache.zookeeper.server.quorum.Learner)
> [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
> 2023-07-25 06:47:54,983 INFO Peer state changed: following - discovery
> (org.apache.zookeeper.server.quorum.QuorumPeer)
> [QuorumPeer[myid=1](plain=127.0.0.1:12181)(secure=[0:0:0:0:0:0:0:0]:2181)]
> 2023-07-25 06:47:54,984 WARN Unexpected exception, tries=0, remaining init
> limit=10000, connecting to /0.0.0.0:2888
> (org.apache.zookeeper.server.quorum.Learner) [LeaderConnector-/0.0.0.0:2888]
> java.net.ConnectException: Connection refused
> at java.base/sun.nio.ch.Net.pollConnect(Native Method)
> at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
> at
> java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
> at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
> at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
> at java.base/java.net.Socket.connect(Socket.java:633)
> at
> java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:304)
> at
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:292)
> at
> org.apache.zookeeper.server.quorum.Learner$LeaderConnector.connectToLeader(Learner.java:408)
> at
> org.apache.zookeeper.server.quorum.Learner$LeaderConnector.run(Learner.java:366){code}
>
> One thing I found, is this issue happened after "Restarting leader election"
> on the follower node. Not sure if it is related.
>
> I was thinking if it is some race condition between "restarting leader
> election" happened (reset vote to itself) and vote update. But as mentioned
> above, this issue keeps happening after next round of leader election.
>
> *The configuration and setup:*
> # 2 zookeeper nodes
> # each zookeeper node, we set the IP of itself to 0.0.0.0, to workaround
> slow DNS in k8s issue (i.e. ZOOKEEPER-4708). That is,
> For node 1, we have:
> {code:java}
> server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181
> server.2=dev-dev2-zookeeper-1.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181{code}
> For node 2, we have:
> {code:java}
> server.1=dev-dev2-zookeeper-0.dev-dev2-zookeeper-nodes.kafka.svc:2888:3888:participant;127.0.0.1:12181
> server.2=0.0.0.0:2888:3888:participant;127.0.0.1:12181 {code}
> Logs:
> [zookeeper-custom-image-rep1.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158038/zookeeper-custom-image-rep1.txt]
> [zookeeper-custom-image-rep2.txt|https://github.com/strimzi/strimzi-kafka-operator/files/12158039/zookeeper-custom-image-rep2.txt]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)