Hi Enrico, we are working on the Strimzi project (deploying Apache Kafka on Kubernetes, so together with ZooKeeper). It has been working fine until Apache Kafka was using ZooKeeper 3.6.3 (or any other previous version). With 3.6.4 we are facing the issue I described.
Thanks, Paolo On Mon, 19 Jun 2023 at 23:29, Enrico Olivelli <eolive...@gmail.com> wrote: > Paolo, > > Il Lun 19 Giu 2023, 16:43 Paolo Patierno <paolo.patie...@gmail.com> ha > scritto: > > > Hi all, > > We were able to overcome the binding issue by setting > > quorumListenOnAllIPs=true but from there we are getting a new issue that > is > > preventing leader election completion on first start-up. > > > > Getting the log of the current ZooKeeper leader (ID=3) we see the > > following. > > (Starting with ** you can see some additional logs added to > > org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader in order to > > get more information.) > > > > 2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, 3],[1, > > 3]]; starting up and setting last processed zxid: 0x100000000 > > (org.apache.zookeeper.server.quorum.Leader) > > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > 2023-06-19 12:32:51,990 INFO ** > > > > > newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr > > = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/ > > 172.17.0.6:2888 (org.apache.zookeeper.server.quorum.Leader) > > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > 2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = > > > > > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:2888 > > (org.apache.zookeeper.server.quorum.Leader) > > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > 2023-06-19 12:32:51,992 INFO ** qs.addr > > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/ > > 172.17.0.6:2888, qs.electionAddr > > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/ > > 172.17.0.6:3888, qs.clientAddr/127.0.0.1:12181 > > (org.apache.zookeeper.server.quorum.QuorumPeer) > > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > 2023-06-19 12:32:51,992 DEBUG zookeeper > > (org.apache.zookeeper.common.PathTrie) > > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > 2023-06-19 12:32:51,993 WARN Restarting Leader Election > > (org.apache.zookeeper.server.quorum.QuorumPeer) > > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > > > So the leader is ZooKeeper with ID=3 and it was ACKed by the ZooKeeper > node > > ID=1. > > As you can see we are in the Leader#startZkServer method, and because of > > the reconfiguration enabled, the designatedLeader is processed. The > problem > > is that the Leader#getDesignatedLeader is not returning “self” as leader > > but another one (ID=1), because of the difference in the quorum address. > > From the above log, it’s not an actual difference in terms of addresses > but > > the self.getQuorumAddress() is returning an <unresolved> one (even if > it’s > > still the same hostname related to ZooKeeper-2 instance). This difference > > causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still > > reported as leader but it’s not able to commit, so prevents any requests > > and the ZooKeeper ensemble gets stuck. > > > > 2023-06-19 12:32:51,996 WARN Suggested leader: 1 > > (org.apache.zookeeper.server.quorum.QuorumPeer) > > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > 2023-06-19 12:32:51,996 WARN This leader is not the designated leader, it > > will be initialized with allowedToCommit = false > > (org.apache.zookeeper.server.quorum.Leader) > > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > > > The overall issue could be related to DNS problems, with DNS records not > > registered yet during pod initialization (where ZooKeeper is running on > > Kubernetes). But we don’t understand why it’s not able to recover > somehow. > > > > Instead of using the quorumListenOnAllIPs=true we also tried a different > > approach by using the 0.0.0.0 address for the binding, so something like: > > > > # Zookeeper nodes configuration > > server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181 > > > > > server.2=my-cluster-zookeeper-1.my-cluster-zookeeper-nodes.default.svc:2888:3888:participant; > > 127.0.0.1:12181 > > > > > server.3=my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc:2888:3888:participant; > > 127.0.0.1:12181 > > > > This way, the self.getQuorumAddress() is not suffering the same problem, > it > > doesn’t return an <unresolved> address but always an actual one. No new > > leader election is needed and everything works fine. > > > > This is the release notes page for 3.6.4. > https://zookeeper.apache.org/doc/r3.6.4/releasenotes.html > > As you are running on k8s, I guess you are using a statefulset, maybe with > a service with ClusterIP? > > Is the readyness probe failing? In that case the dns name should not be > available > > What are you using to perform the probe? Are you using the four letters > words API ? (ruok) > > > Enrico > > > > On 2023/06/14 06:19:24 Szalay-Bekő Máté wrote: > > > Interesting... > > > > > > I am not familiar with strimzi.io. > > > Quickly checking the release notes, I don't see anything suspicious: > > > https://zookeeper.apache.org/doc/r3.6.4/releasenotes.html > > > Also, QuorumCnxManager was not changed for 2+ years on branch 3.6. > > > > > > Are you use the same java version and zookeeper config for 3.6.3 and > > 3.6.4? > > > Can you share the zookeeper config? > > > > > > Also: zookeeper 3.6 is deprecated since december 2022. Can you > reproduce > > > the issue on newer ZooKeeper versions? > > > > > > best regards, > > > Máté > > > > > > On Tue, Jun 13, 2023 at 10:16 AM Luke Chen <sh...@gmail.com> wrote: > > > > > > > Hi all, > > > > > > > > We're running zookeeper under minikube using strimzi < > > https://strimzi.io/ > > > > >. > > > > The zookeeper works well while running with ZK v3.6.3. But when we > > upgraded > > > > to v3.6.4, we encountered hostname unresolved issue. I'm wondering if > > this > > > > is a regression that some changes between v3.6.3 and v3.6.4 cause > this > > > > issue? > > > > > > > > Logs: > > > > ==== > > > > 2023-06-12 12:25:38,149 INFO binding to port /127.0.0.1:12181 > > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main] > > > > 2023-06-12 12:25:38,194 INFO bound to port 12181 > > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main] > > > > 2023-06-12 12:25:38,194 INFO binding to port 0.0.0.0/0.0.0.0:2181 > > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main] > > > > 2023-06-12 12:25:38,195 INFO bound to port 2181 > > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main] > > > > 2023-06-12 12:25:38,195 INFO Using 4000ms as the quorum cnxn socket > > timeout > > > > (org.apache.zookeeper.server.quorum.QuorumPeer) [main] > > > > 2023-06-12 12:25:38,199 INFO Election port bind maximum retries is > > infinite > > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager) [main] > > > > 2023-06-12 12:25:38,201 INFO Creating TLS-only quorum server socket > > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager) > > > > > > > > > > > > > [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:3888] > > > > 2023-06-12 12:25:38,202 INFO ZooKeeper audit is disabled. > > > > (org.apache.zookeeper.audit.ZKAuditProvider) [main] > > > > 2023-06-12 12:25:38,202 ERROR Exception while listening > > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager) > > > > > > > > > > > > > [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:3888] > > > > java.net.SocketException: Unresolved address > > > > at java.base/java.net.ServerSocket.bind(ServerSocket.java:380) > > > > at java.base/java.net.ServerSocket.bind(ServerSocket.java:342) > > > > at > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.createNewServerSocket(QuorumCnxManager.java:1135) > > > > at > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.acceptConnections(QuorumCnxManager.java:1064) > > > > at > > > > > > > > > > > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.run(QuorumCnxManager.java:1033) > > > > at > > > > > > > > > > > > > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) > > > > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > > > > at > > > > > > > > > > > > > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > > > > at > > > > > > > > > > > > > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > > > > at java.base/java.lang.Thread.run(Thread.java:833) > > > > > > > > ==== > > > > > > > > Any thoughts or suggestions are welcomed. > > > > > > > > Thank you. > > > > Luke > > > > > > > > > > -- Paolo Patierno *Senior Principal Software Engineer @ Red Hat**Microsoft MVP on **Azure* Twitter : @ppatierno <http://twitter.com/ppatierno> Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno> GitHub : ppatierno <https://github.com/ppatierno>