Thanks Luke! I have applied your PRs on a ZooKeeper 3.6.4 codebase and they both fix the issues I was facing with Kubernetes-based environments (which again weren't happening until ZooKeeper 3.6.3), I approved your them but, of course, my approvals are not "binding". It would be great to have some ZooKeeper maintainers/committers jumping into this discussion, reviewing the PRs and provide feedback.
Thanks, Paolo. On Wed, 2 Aug 2023 at 10:03, Luke Chen <[email protected]> wrote: > Hi all, > > We've identified some issues related to the problem, and opened 2 PRs: > https://github.com/apache/zookeeper/pull/2040 > https://github.com/apache/zookeeper/pull/2041 > > We'd like to get your feedback on these PRs. > Please take a look when available. > > Thank you. > Luke > > > On Tue, Jun 27, 2023 at 1:34 PM Paolo Patierno <[email protected]> > wrote: > > > Hi all, > > we are still facing this issue and I opened > > https://issues.apache.org/jira/browse/ZOOKEEPER-4708 > > I went through git bisect to try understanding where the "problem" > started > > to happen between 3.6.3 and 3.6.4. > > To avoid repeating myself you can find details on the JIRA ticket. > > Any help is really appreciated :-) > > > > Thanks, > > Paolo > > > > On Tue, 20 Jun 2023 at 08:46, Paolo Patierno <[email protected]> > > wrote: > > > > > Hi Enrico, > > > we are working on the Strimzi project (deploying Apache Kafka on > > > Kubernetes, so together with ZooKeeper). > > > It has been working fine until Apache Kafka was using ZooKeeper 3.6.3 > (or > > > any other previous version). > > > With 3.6.4 we are facing the issue I described. > > > > > > Thanks, > > > Paolo > > > > > > On Mon, 19 Jun 2023 at 23:29, Enrico Olivelli <[email protected]> > > wrote: > > > > > >> Paolo, > > >> > > >> Il Lun 19 Giu 2023, 16:43 Paolo Patierno <[email protected]> > ha > > >> scritto: > > >> > > >> > Hi all, > > >> > We were able to overcome the binding issue by setting > > >> > quorumListenOnAllIPs=true but from there we are getting a new issue > > >> that is > > >> > preventing leader election completion on first start-up. > > >> > > > >> > Getting the log of the current ZooKeeper leader (ID=3) we see the > > >> > following. > > >> > (Starting with ** you can see some additional logs added to > > >> > org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader in > order > > >> to > > >> > get more information.) > > >> > > > >> > 2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1, > > >> 3],[1, > > >> > 3]]; starting up and setting last processed zxid: 0x100000000 > > >> > (org.apache.zookeeper.server.quorum.Leader) > > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > >> > 2023-06-19 12:32:51,990 INFO ** > > >> > > > >> > > > >> > > > newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr > > >> > = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/ > > >> > 172.17.0.6:2888 (org.apache.zookeeper.server.quorum.Leader) > > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > >> > 2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() = > > >> > > > >> > > > >> > > > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:2888 > > >> > (org.apache.zookeeper.server.quorum.Leader) > > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > >> > 2023-06-19 12:32:51,992 INFO ** qs.addr > > >> > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/ > > >> > 172.17.0.6:2888, qs.electionAddr > > >> > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/ > > >> > 172.17.0.6:3888, qs.clientAddr/127.0.0.1:12181 > > >> > (org.apache.zookeeper.server.quorum.QuorumPeer) > > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > >> > 2023-06-19 12:32:51,992 DEBUG zookeeper > > >> > (org.apache.zookeeper.common.PathTrie) > > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > >> > 2023-06-19 12:32:51,993 WARN Restarting Leader Election > > >> > (org.apache.zookeeper.server.quorum.QuorumPeer) > > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > >> > > > >> > So the leader is ZooKeeper with ID=3 and it was ACKed by the > ZooKeeper > > >> node > > >> > ID=1. > > >> > As you can see we are in the Leader#startZkServer method, and > because > > of > > >> > the reconfiguration enabled, the designatedLeader is processed. The > > >> problem > > >> > is that the Leader#getDesignatedLeader is not returning “self” as > > leader > > >> > but another one (ID=1), because of the difference in the quorum > > address. > > >> > From the above log, it’s not an actual difference in terms of > > addresses > > >> but > > >> > the self.getQuorumAddress() is returning an <unresolved> one (even > if > > >> it’s > > >> > still the same hostname related to ZooKeeper-2 instance). This > > >> difference > > >> > causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still > > >> > reported as leader but it’s not able to commit, so prevents any > > requests > > >> > and the ZooKeeper ensemble gets stuck. > > >> > > > >> > 2023-06-19 12:32:51,996 WARN Suggested leader: 1 > > >> > (org.apache.zookeeper.server.quorum.QuorumPeer) > > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > >> > 2023-06-19 12:32:51,996 WARN This leader is not the designated > leader, > > >> it > > >> > will be initialized with allowedToCommit = false > > >> > (org.apache.zookeeper.server.quorum.Leader) > > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)] > > >> > > > >> > The overall issue could be related to DNS problems, with DNS records > > not > > >> > registered yet during pod initialization (where ZooKeeper is running > > on > > >> > Kubernetes). But we don’t understand why it’s not able to recover > > >> somehow. > > >> > > > >> > Instead of using the quorumListenOnAllIPs=true we also tried a > > different > > >> > approach by using the 0.0.0.0 address for the binding, so something > > >> like: > > >> > > > >> > # Zookeeper nodes configuration > > >> > server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181 > > >> > > > >> > > > >> > > > server.2=my-cluster-zookeeper-1.my-cluster-zookeeper-nodes.default.svc:2888:3888:participant; > > >> > 127.0.0.1:12181 > > >> > > > >> > > > >> > > > server.3=my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc:2888:3888:participant; > > >> > 127.0.0.1:12181 > > >> > > > >> > This way, the self.getQuorumAddress() is not suffering the same > > >> problem, it > > >> > doesn’t return an <unresolved> address but always an actual one. No > > new > > >> > leader election is needed and everything works fine. > > >> > > > >> > > >> This is the release notes page for 3.6.4. > > >> https://zookeeper.apache.org/doc/r3.6.4/releasenotes.html > > >> > > >> As you are running on k8s, I guess you are using a statefulset, maybe > > with > > >> a service with ClusterIP? > > >> > > >> Is the readyness probe failing? In that case the dns name should not > be > > >> available > > >> > > >> What are you using to perform the probe? Are you using the four > letters > > >> words API ? (ruok) > > >> > > >> > > >> Enrico > > >> > > >> > > >> > On 2023/06/14 06:19:24 Szalay-Bekő Máté wrote: > > >> > > Interesting... > > >> > > > > >> > > I am not familiar with strimzi.io. > > >> > > Quickly checking the release notes, I don't see anything > suspicious: > > >> > > https://zookeeper.apache.org/doc/r3.6.4/releasenotes.html > > >> > > Also, QuorumCnxManager was not changed for 2+ years on branch 3.6. > > >> > > > > >> > > Are you use the same java version and zookeeper config for 3.6.3 > and > > >> > 3.6.4? > > >> > > Can you share the zookeeper config? > > >> > > > > >> > > Also: zookeeper 3.6 is deprecated since december 2022. Can you > > >> reproduce > > >> > > the issue on newer ZooKeeper versions? > > >> > > > > >> > > best regards, > > >> > > Máté > > >> > > > > >> > > On Tue, Jun 13, 2023 at 10:16 AM Luke Chen <[email protected]> > wrote: > > >> > > > > >> > > > Hi all, > > >> > > > > > >> > > > We're running zookeeper under minikube using strimzi < > > >> > https://strimzi.io/ > > >> > > > >. > > >> > > > The zookeeper works well while running with ZK v3.6.3. But when > we > > >> > upgraded > > >> > > > to v3.6.4, we encountered hostname unresolved issue. I'm > wondering > > >> if > > >> > this > > >> > > > is a regression that some changes between v3.6.3 and v3.6.4 > cause > > >> this > > >> > > > issue? > > >> > > > > > >> > > > Logs: > > >> > > > ==== > > >> > > > 2023-06-12 12:25:38,149 INFO binding to port /127.0.0.1:12181 > > >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main] > > >> > > > 2023-06-12 12:25:38,194 INFO bound to port 12181 > > >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main] > > >> > > > 2023-06-12 12:25:38,194 INFO binding to port > 0.0.0.0/0.0.0.0:2181 > > >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main] > > >> > > > 2023-06-12 12:25:38,195 INFO bound to port 2181 > > >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main] > > >> > > > 2023-06-12 12:25:38,195 INFO Using 4000ms as the quorum cnxn > > socket > > >> > timeout > > >> > > > (org.apache.zookeeper.server.quorum.QuorumPeer) [main] > > >> > > > 2023-06-12 12:25:38,199 INFO Election port bind maximum retries > is > > >> > infinite > > >> > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager) [main] > > >> > > > 2023-06-12 12:25:38,201 INFO Creating TLS-only quorum server > > socket > > >> > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager) > > >> > > > > > >> > > > > > >> > > > >> > > > >> > > > [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:3888] > > >> > > > 2023-06-12 12:25:38,202 INFO ZooKeeper audit is disabled. > > >> > > > (org.apache.zookeeper.audit.ZKAuditProvider) [main] > > >> > > > 2023-06-12 12:25:38,202 ERROR Exception while listening > > >> > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager) > > >> > > > > > >> > > > > > >> > > > >> > > > >> > > > [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:3888] > > >> > > > java.net.SocketException: Unresolved address > > >> > > > at java.base/java.net.ServerSocket.bind(ServerSocket.java:380) > > >> > > > at java.base/java.net.ServerSocket.bind(ServerSocket.java:342) > > >> > > > at > > >> > > > > > >> > > > > > >> > > > >> > > > >> > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.createNewServerSocket(QuorumCnxManager.java:1135) > > >> > > > at > > >> > > > > > >> > > > > > >> > > > >> > > > >> > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.acceptConnections(QuorumCnxManager.java:1064) > > >> > > > at > > >> > > > > > >> > > > > > >> > > > >> > > > >> > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.run(QuorumCnxManager.java:1033) > > >> > > > at > > >> > > > > > >> > > > > > >> > > > >> > > > >> > > > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) > > >> > > > at > > >> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > > >> > > > at > > >> > > > > > >> > > > > > >> > > > >> > > > >> > > > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > > >> > > > at > > >> > > > > > >> > > > > > >> > > > >> > > > >> > > > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > > >> > > > at java.base/java.lang.Thread.run(Thread.java:833) > > >> > > > > > >> > > > ==== > > >> > > > > > >> > > > Any thoughts or suggestions are welcomed. > > >> > > > > > >> > > > Thank you. > > >> > > > Luke > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > > -- > > > Paolo Patierno > > > > > > *Senior Principal Software Engineer @ Red Hat**Microsoft MVP on > **Azure* > > > > > > Twitter : @ppatierno <http://twitter.com/ppatierno> > > > Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno> > > > GitHub : ppatierno <https://github.com/ppatierno> > > > > > > > > > -- > > Paolo Patierno > > > > *Senior Principal Software Engineer @ Red Hat**Microsoft MVP on **Azure* > > > > Twitter : @ppatierno <http://twitter.com/ppatierno> > > Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno> > > GitHub : ppatierno <https://github.com/ppatierno> > > > -- Paolo Patierno *Senior Principal Software Engineer @ Red Hat**Microsoft MVP on **Azure* Twitter : @ppatierno <http://twitter.com/ppatierno> Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno> GitHub : ppatierno <https://github.com/ppatierno>
