Thanks Luke!

I have applied your PRs on a ZooKeeper 3.6.4 codebase and they both fix the
issues I was facing with Kubernetes-based environments (which again weren't
happening until ZooKeeper 3.6.3),
I approved your them but, of course, my approvals are not "binding".
It would be great to have some ZooKeeper maintainers/committers jumping
into this discussion, reviewing the PRs and provide feedback.

Thanks,
Paolo.

On Wed, 2 Aug 2023 at 10:03, Luke Chen <[email protected]> wrote:

> Hi all,
>
> We've identified some issues related to the problem, and opened 2 PRs:
> https://github.com/apache/zookeeper/pull/2040
> https://github.com/apache/zookeeper/pull/2041
>
> We'd like to get your feedback on these PRs.
> Please take a look when available.
>
> Thank you.
> Luke
>
>
> On Tue, Jun 27, 2023 at 1:34 PM Paolo Patierno <[email protected]>
> wrote:
>
> > Hi all,
> > we are still facing this issue and I opened
> > https://issues.apache.org/jira/browse/ZOOKEEPER-4708
> > I went through git bisect to try understanding where the "problem"
> started
> > to happen between 3.6.3 and 3.6.4.
> > To avoid repeating myself you can find details on the JIRA ticket.
> > Any help is really appreciated :-)
> >
> > Thanks,
> > Paolo
> >
> > On Tue, 20 Jun 2023 at 08:46, Paolo Patierno <[email protected]>
> > wrote:
> >
> > > Hi Enrico,
> > > we are working on the Strimzi project (deploying Apache Kafka on
> > > Kubernetes, so together with ZooKeeper).
> > > It has been working fine until Apache Kafka was using ZooKeeper 3.6.3
> (or
> > > any other previous version).
> > > With 3.6.4 we are facing the issue I described.
> > >
> > > Thanks,
> > > Paolo
> > >
> > > On Mon, 19 Jun 2023 at 23:29, Enrico Olivelli <[email protected]>
> > wrote:
> > >
> > >> Paolo,
> > >>
> > >> Il Lun 19 Giu 2023, 16:43 Paolo Patierno <[email protected]>
> ha
> > >> scritto:
> > >>
> > >> > Hi all,
> > >> > We were able to overcome the binding issue by setting
> > >> > quorumListenOnAllIPs=true but from there we are getting a new issue
> > >> that is
> > >> > preventing leader election completion on first start-up.
> > >> >
> > >> > Getting the log of the current ZooKeeper leader (ID=3) we see the
> > >> > following.
> > >> > (Starting with ** you can see some additional logs added to
> > >> > org.apache.zookeeper.server.quorum.Leader#getDesignatedLeader in
> order
> > >> to
> > >> > get more information.)
> > >> >
> > >> > 2023-06-19 12:32:51,990 INFO Have quorum of supporters, sids: [[1,
> > >> 3],[1,
> > >> > 3]]; starting up and setting last processed zxid: 0x100000000
> > >> > (org.apache.zookeeper.server.quorum.Leader)
> > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> > >> > 2023-06-19 12:32:51,990 INFO **
> > >> >
> > >> >
> > >>
> >
> newQVAcksetPair.getQuorumVerifier().getVotingMembers().get(self.getId()).addr
> > >> > = my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/
> > >> > 172.17.0.6:2888 (org.apache.zookeeper.server.quorum.Leader)
> > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> > >> > 2023-06-19 12:32:51,990 INFO ** self.getQuorumAddress() =
> > >> >
> > >> >
> > >>
> >
> my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:2888
> > >> > (org.apache.zookeeper.server.quorum.Leader)
> > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> > >> > 2023-06-19 12:32:51,992 INFO ** qs.addr
> > >> > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/
> > >> > 172.17.0.6:2888, qs.electionAddr
> > >> > my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/
> > >> > 172.17.0.6:3888, qs.clientAddr/127.0.0.1:12181
> > >> > (org.apache.zookeeper.server.quorum.QuorumPeer)
> > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> > >> > 2023-06-19 12:32:51,992 DEBUG zookeeper
> > >> > (org.apache.zookeeper.common.PathTrie)
> > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> > >> > 2023-06-19 12:32:51,993 WARN Restarting Leader Election
> > >> > (org.apache.zookeeper.server.quorum.QuorumPeer)
> > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> > >> >
> > >> > So the leader is ZooKeeper with ID=3 and it was ACKed by the
> ZooKeeper
> > >> node
> > >> > ID=1.
> > >> > As you can see we are in the Leader#startZkServer method, and
> because
> > of
> > >> > the reconfiguration enabled, the designatedLeader is processed. The
> > >> problem
> > >> > is that the Leader#getDesignatedLeader is not returning “self” as
> > leader
> > >> > but another one (ID=1), because of the difference in the quorum
> > address.
> > >> > From the above log, it’s not an actual difference in terms of
> > addresses
> > >> but
> > >> > the self.getQuorumAddress() is returning an <unresolved> one (even
> if
> > >> it’s
> > >> > still the same hostname related to ZooKeeper-2 instance). This
> > >> difference
> > >> > causes the allowedToCommit=false, meanwhile the ZooKeeper-2 is still
> > >> > reported as leader but it’s not able to commit, so prevents any
> > requests
> > >> > and the ZooKeeper ensemble gets stuck.
> > >> >
> > >> > 2023-06-19 12:32:51,996 WARN Suggested leader: 1
> > >> > (org.apache.zookeeper.server.quorum.QuorumPeer)
> > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> > >> > 2023-06-19 12:32:51,996 WARN This leader is not the designated
> leader,
> > >> it
> > >> > will be initialized with allowedToCommit = false
> > >> > (org.apache.zookeeper.server.quorum.Leader)
> > >> > [QuorumPeer[myid=3](plain=127.0.0.1:12181)(secure=0.0.0.0:2181)]
> > >> >
> > >> > The overall issue could be related to DNS problems, with DNS records
> > not
> > >> > registered yet during pod initialization (where ZooKeeper is running
> > on
> > >> > Kubernetes). But we don’t understand why it’s not able to recover
> > >> somehow.
> > >> >
> > >> > Instead of using the quorumListenOnAllIPs=true we also tried a
> > different
> > >> > approach by using the 0.0.0.0 address for the binding, so something
> > >> like:
> > >> >
> > >> > # Zookeeper nodes configuration
> > >> > server.1=0.0.0.0:2888:3888:participant;127.0.0.1:12181
> > >> >
> > >> >
> > >>
> >
> server.2=my-cluster-zookeeper-1.my-cluster-zookeeper-nodes.default.svc:2888:3888:participant;
> > >> > 127.0.0.1:12181
> > >> >
> > >> >
> > >>
> >
> server.3=my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc:2888:3888:participant;
> > >> > 127.0.0.1:12181
> > >> >
> > >> > This way, the self.getQuorumAddress() is not suffering the same
> > >> problem, it
> > >> > doesn’t return an <unresolved> address but always an actual one. No
> > new
> > >> > leader election is needed and everything works fine.
> > >> >
> > >>
> > >> This is the release notes page for 3.6.4.
> > >> https://zookeeper.apache.org/doc/r3.6.4/releasenotes.html
> > >>
> > >> As you are running on k8s, I guess you are using a statefulset, maybe
> > with
> > >> a service with ClusterIP?
> > >>
> > >> Is the readyness probe failing? In that case the dns name should not
> be
> > >> available
> > >>
> > >> What are you using to perform the probe? Are you using the four
> letters
> > >> words API ? (ruok)
> > >>
> > >>
> > >> Enrico
> > >>
> > >>
> > >> > On 2023/06/14 06:19:24 Szalay-Bekő Máté wrote:
> > >> > > Interesting...
> > >> > >
> > >> > > I am not familiar with strimzi.io.
> > >> > > Quickly checking the release notes, I don't see anything
> suspicious:
> > >> > > https://zookeeper.apache.org/doc/r3.6.4/releasenotes.html
> > >> > > Also, QuorumCnxManager was not changed for 2+ years on branch 3.6.
> > >> > >
> > >> > > Are you use the same java version and zookeeper config for 3.6.3
> and
> > >> > 3.6.4?
> > >> > > Can you share the zookeeper config?
> > >> > >
> > >> > > Also: zookeeper 3.6 is deprecated since december 2022. Can you
> > >> reproduce
> > >> > > the issue on newer ZooKeeper versions?
> > >> > >
> > >> > > best regards,
> > >> > > Máté
> > >> > >
> > >> > > On Tue, Jun 13, 2023 at 10:16 AM Luke Chen <[email protected]>
> wrote:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > > We're running zookeeper under minikube using strimzi <
> > >> > https://strimzi.io/
> > >> > > > >.
> > >> > > > The zookeeper works well while running with ZK v3.6.3. But when
> we
> > >> > upgraded
> > >> > > > to v3.6.4, we encountered hostname unresolved issue. I'm
> wondering
> > >> if
> > >> > this
> > >> > > > is a regression that some changes between v3.6.3 and v3.6.4
> cause
> > >> this
> > >> > > > issue?
> > >> > > >
> > >> > > > Logs:
> > >> > > > ====
> > >> > > > 2023-06-12 12:25:38,149 INFO binding to port /127.0.0.1:12181
> > >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main]
> > >> > > > 2023-06-12 12:25:38,194 INFO bound to port 12181
> > >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main]
> > >> > > > 2023-06-12 12:25:38,194 INFO binding to port
> 0.0.0.0/0.0.0.0:2181
> > >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main]
> > >> > > > 2023-06-12 12:25:38,195 INFO bound to port 2181
> > >> > > > (org.apache.zookeeper.server.NettyServerCnxnFactory) [main]
> > >> > > > 2023-06-12 12:25:38,195 INFO Using 4000ms as the quorum cnxn
> > socket
> > >> > timeout
> > >> > > > (org.apache.zookeeper.server.quorum.QuorumPeer) [main]
> > >> > > > 2023-06-12 12:25:38,199 INFO Election port bind maximum retries
> is
> > >> > infinite
> > >> > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager) [main]
> > >> > > > 2023-06-12 12:25:38,201 INFO Creating TLS-only quorum server
> > socket
> > >> > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager)
> > >> > > >
> > >> > > >
> > >> >
> > >> >
> > >>
> >
> [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:3888]
> > >> > > > 2023-06-12 12:25:38,202 INFO ZooKeeper audit is disabled.
> > >> > > > (org.apache.zookeeper.audit.ZKAuditProvider) [main]
> > >> > > > 2023-06-12 12:25:38,202 ERROR Exception while listening
> > >> > > > (org.apache.zookeeper.server.quorum.QuorumCnxManager)
> > >> > > >
> > >> > > >
> > >> >
> > >> >
> > >>
> >
> [ListenerHandler-my-cluster-zookeeper-2.my-cluster-zookeeper-nodes.default.svc/<unresolved>:3888]
> > >> > > > java.net.SocketException: Unresolved address
> > >> > > > at java.base/java.net.ServerSocket.bind(ServerSocket.java:380)
> > >> > > > at java.base/java.net.ServerSocket.bind(ServerSocket.java:342)
> > >> > > > at
> > >> > > >
> > >> > > >
> > >> >
> > >> >
> > >>
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.createNewServerSocket(QuorumCnxManager.java:1135)
> > >> > > > at
> > >> > > >
> > >> > > >
> > >> >
> > >> >
> > >>
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.acceptConnections(QuorumCnxManager.java:1064)
> > >> > > > at
> > >> > > >
> > >> > > >
> > >> >
> > >> >
> > >>
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.run(QuorumCnxManager.java:1033)
> > >> > > > at
> > >> > > >
> > >> > > >
> > >> >
> > >> >
> > >>
> >
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
> > >> > > > at
> > >> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> > >> > > > at
> > >> > > >
> > >> > > >
> > >> >
> > >> >
> > >>
> >
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> > >> > > > at
> > >> > > >
> > >> > > >
> > >> >
> > >> >
> > >>
> >
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> > >> > > > at java.base/java.lang.Thread.run(Thread.java:833)
> > >> > > >
> > >> > > > ====
> > >> > > >
> > >> > > > Any thoughts or suggestions are welcomed.
> > >> > > >
> > >> > > > Thank you.
> > >> > > > Luke
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> > > --
> > > Paolo Patierno
> > >
> > > *Senior Principal Software Engineer @ Red Hat**Microsoft MVP on
> **Azure*
> > >
> > > Twitter : @ppatierno <http://twitter.com/ppatierno>
> > > Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno>
> > > GitHub : ppatierno <https://github.com/ppatierno>
> > >
> >
> >
> > --
> > Paolo Patierno
> >
> > *Senior Principal Software Engineer @ Red Hat**Microsoft MVP on **Azure*
> >
> > Twitter : @ppatierno <http://twitter.com/ppatierno>
> > Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno>
> > GitHub : ppatierno <https://github.com/ppatierno>
> >
>


-- 
Paolo Patierno

*Senior Principal Software Engineer @ Red Hat**Microsoft MVP on **Azure*

Twitter : @ppatierno <http://twitter.com/ppatierno>
Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno>
GitHub : ppatierno <https://github.com/ppatierno>

Reply via email to