Re: 3-node Zookeeper ensemble unable to recover if leader fails

João Silva Wed, 24 May 2023 01:24:55 -0700

Hi Enrico,

No, I'm not on a managed environment.
Both Zookeeper and Kafka are installed directly on the machines in AWS.
And I can ssh into the nodes and nc among the machines.
They can communicate among them on the Zookeeper ports.
That's what makes this so weird.


I'm not sure if this helps, but on the logs; while the failed machine is
down, I keep seeing this:
(Machine-a has id 2. The failing machine is machine-b with id 3. The other
healthy machine is machine-c with id 4.)

May 03 14:18:40 machine-a zookeeper[27670]: [2023-05-03 14:18:40,060] WARN
> Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn)
> May 03 14:18:40 machine-a zookeeper[27670]: java.io.IOException:
> ZooKeeperServer not running
> May 03 14:18:40 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
> May 03 14:18:40 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
> May 03 14:18:40 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
> May 03 14:18:40 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
> May 03 14:18:40 machine-a zookeeper[27670]:         at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> May 03 14:18:40 machine-a zookeeper[27670]:         at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> May 03 14:18:40 machine-a zookeeper[27670]:         at
> java.base/java.lang.Thread.run(Thread.java:829)
> May 03 14:18:47 machine-a zookeeper[27670]: [2023-05-03 14:18:47,988] WARN
> Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn)
> May 03 14:18:47 machine-a zookeeper[27670]: java.io.IOException:
> ZooKeeperServer not running
> May 03 14:18:47 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
> May 03 14:18:47 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
> May 03 14:18:47 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
> May 03 14:18:47 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
> May 03 14:18:47 machine-a zookeeper[27670]:         at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> May 03 14:18:47 machine-a zookeeper[27670]:         at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> May 03 14:18:47 machine-a zookeeper[27670]:         at
> java.base/java.lang.Thread.run(Thread.java:829)
> May 03 14:18:49 machine-a zookeeper[27670]: [2023-05-03 14:18:49,397] WARN
> Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn)
> May 03 14:18:49 machine-a zookeeper[27670]: java.io.IOException:
> ZooKeeperServer not running
> May 03 14:18:49 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
> May 03 14:18:49 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
> May 03 14:18:49 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
> May 03 14:18:49 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
> May 03 14:18:49 machine-a zookeeper[27670]:         at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> May 03 14:18:49 machine-a zookeeper[27670]:         at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> May 03 14:18:49 machine-a zookeeper[27670]:         at
> java.base/java.lang.Thread.run(Thread.java:829)
> May 03 14:18:52 machine-a zookeeper[27670]: [2023-05-03 14:18:52,395] INFO
> Notification time out: 60000
> (org.apache.zookeeper.server.quorum.FastLeaderElection)
> May 03 14:18:52 machine-a zookeeper[27670]: [2023-05-03 14:18:52,398] INFO
> Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:2,
> n.round:0x7, n.peerEpoch:0x5, n.zxid:0x5000005a0, message format
> version:0x2, n.config version:0x0
> (org.apache.zookeeper.server.quorum.FastLeaderElection)
> May 03 14:18:56 machine-a zookeeper[27670]: [2023-05-03 14:18:56,725] WARN
> Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn)
> May 03 14:18:56 machine-a zookeeper[27670]: java.io.IOException:
> ZooKeeperServer not running
> May 03 14:18:56 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
> May 03 14:18:56 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
> May 03 14:18:56 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
> May 03 14:18:56 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
> May 03 14:18:56 machine-a zookeeper[27670]:         at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> May 03 14:18:56 machine-a zookeeper[27670]:         at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> May 03 14:18:56 machine-a zookeeper[27670]:         at
> java.base/java.lang.Thread.run(Thread.java:829)
> May 03 14:18:57 machine-a zookeeper[27670]: [2023-05-03 14:18:57,403] WARN
> Cannot open channel to 3 at election address
> ip-XXX-XXX-XXX-XXX.eu-west-X.compute.internal/XXX.XXX.XXX.XXX:3888
> (org.apache.zookeeper.server.quorum.QuorumCnxManager)
> May 03 14:18:57 machine-a zookeeper[27670]:
> java.net.SocketTimeoutException: connect timed out
> May 03 14:18:57 machine-a zookeeper[27670]:         at
> java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
> May 03 14:18:57 machine-a zookeeper[27670]:         at
> java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
> May 03 14:18:57 machine-a zookeeper[27670]:         at
> java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
> May 03 14:18:57 machine-a zookeeper[27670]:         at
> java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
> May 03 14:18:57 machine-a zookeeper[27670]:         at
> java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> May 03 14:18:57 machine-a zookeeper[27670]:         at
> java.base/java.net.Socket.connect(Socket.java:609)
> May 03 14:18:57 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:383)
> May 03 14:18:57 machine-a zookeeper[27670]:         at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:457)
> May 03 14:18:57 machine-a zookeeper[27670]:         at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> May 03 14:18:57 machine-a zookeeper[27670]:         at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> May 03 14:18:57 machine-a zookeeper[27670]:         at
> java.base/java.lang.Thread.run(Thread.java:829)



Thanks,
João


On Tue, May 23, 2023 at 7:06 PM Enrico Olivelli <[email protected]> wrote:

> Joao,
>
> Il Mar 23 Mag 2023, 16:57 João Silva <[email protected]> ha scritto:
>
> > Hi all,
> >
> > I've configured a 3-node Kafka (2.13) cluster with Zookeeper (3.6.3),
> with
> > each Zookeeper instance living in the same machine as each Kafka broker
> > (Java 11.0.18). Everything worked fine for a long long time.
> >
> > However, the first time a machine failed (so, both an instance of
> Zookeeper
> > and a Kafka broker), the other 2 were unable to continue working (in this
> > case, the leader failed). The 2 Zookeeper instances seemed like they
> > couldn't communicate with each other, and were unable to elect a new
> > leader. But that doesn't make sense, because they were communicating with
> > each other before the failure. Only when the failing machine was booted
> up
> > again, the other 2 machines were able to elect a new leader.
> >
> > From the logs, I don't get much more information than what I explained
> > above. The 2 living machines act like they don't "see" each other, and
> are
> > unable to elect a leader. When the failing machine goes up again, they
> > manage to elect a new leader.
> >
>
> Are you on some managed environment like k8s?
> Are you able to ssh into the nodes and use nc or any other tools to try to
> connect to the other machines?
>
>
> Enrico
>
>
>
> > Does anyone can help me shed some light on this problem?
> >
> > Is there some configuration property I'm missing?
> >
> > From my internet crawl, I got these 2 articles with problems similar to
> > mine, but they don't give a clear answer to why this happened and how to
> > fix it:
> >
> >
> >
> https://stackoverflow.com/questions/54005488/zookeeper-issue-taking-15-minutes-to-recover-if-leader-is-killed
> >
> >
> >
> https://servicesunavailable.wordpress.com/2014/11/11/zookeeper-leader-election-and-timeouts/
> >
> > Thanks in advance,
> > Joao
> >
>

Re: 3-node Zookeeper ensemble unable to recover if leader fails

Reply via email to