Re: 3-node Zookeeper ensemble unable to recover if leader fails

João Silva Wed, 24 May 2023 02:43:05 -0700

Hi,

Some additional information regarding this problem.


This only happens sometimes (not always).
This happened the first time when a machine failed on AWS (after being
working for a very long time; months).
Then I tried to replicate the issue by stopping a machine, and it didn't
work (i.e., the problem didn't manifest).
After some days, I tried to replicate the issue and it happened again (but
not on the subsequent tries right after).
So, this shouldn't be a configuration issue.
It leads me to think that this is related to some problem in the Zookeeper
state...
Only after some time, the failure occurs. And it does not happen if I try
to replicate it right after the time the machine fails.

Thanks,
João



On Wed, May 24, 2023 at 9:24 AM João Silva <[email protected]> wrote:

> Hi Enrico,
>
> No, I'm not on a managed environment.
> Both Zookeeper and Kafka are installed directly on the machines in AWS.
> And I can ssh into the nodes and nc among the machines.
> They can communicate among them on the Zookeeper ports.
> That's what makes this so weird.
>
> I'm not sure if this helps, but on the logs; while the failed machine is
> down, I keep seeing this:
> (Machine-a has id 2. The failing machine is machine-b with id 3. The other
> healthy machine is machine-c with id 4.)
>
> May 03 14:18:40 machine-a zookeeper[27670]: [2023-05-03 14:18:40,060] WARN
>> Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn)
>> May 03 14:18:40 machine-a zookeeper[27670]: java.io.IOException:
>> ZooKeeperServer not running
>> May 03 14:18:40 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
>> May 03 14:18:40 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
>> May 03 14:18:40 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>> May 03 14:18:40 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>> May 03 14:18:40 machine-a zookeeper[27670]:         at
>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>> May 03 14:18:40 machine-a zookeeper[27670]:         at
>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>> May 03 14:18:40 machine-a zookeeper[27670]:         at
>> java.base/java.lang.Thread.run(Thread.java:829)
>> May 03 14:18:47 machine-a zookeeper[27670]: [2023-05-03 14:18:47,988]
>> WARN Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn)
>> May 03 14:18:47 machine-a zookeeper[27670]: java.io.IOException:
>> ZooKeeperServer not running
>> May 03 14:18:47 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
>> May 03 14:18:47 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
>> May 03 14:18:47 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>> May 03 14:18:47 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>> May 03 14:18:47 machine-a zookeeper[27670]:         at
>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>> May 03 14:18:47 machine-a zookeeper[27670]:         at
>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>> May 03 14:18:47 machine-a zookeeper[27670]:         at
>> java.base/java.lang.Thread.run(Thread.java:829)
>> May 03 14:18:49 machine-a zookeeper[27670]: [2023-05-03 14:18:49,397]
>> WARN Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn)
>> May 03 14:18:49 machine-a zookeeper[27670]: java.io.IOException:
>> ZooKeeperServer not running
>> May 03 14:18:49 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
>> May 03 14:18:49 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
>> May 03 14:18:49 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>> May 03 14:18:49 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>> May 03 14:18:49 machine-a zookeeper[27670]:         at
>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>> May 03 14:18:49 machine-a zookeeper[27670]:         at
>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>> May 03 14:18:49 machine-a zookeeper[27670]:         at
>> java.base/java.lang.Thread.run(Thread.java:829)
>> May 03 14:18:52 machine-a zookeeper[27670]: [2023-05-03 14:18:52,395]
>> INFO Notification time out: 60000
>> (org.apache.zookeeper.server.quorum.FastLeaderElection)
>> May 03 14:18:52 machine-a zookeeper[27670]: [2023-05-03 14:18:52,398]
>> INFO Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:2,
>> n.round:0x7, n.peerEpoch:0x5, n.zxid:0x5000005a0, message format
>> version:0x2, n.config version:0x0
>> (org.apache.zookeeper.server.quorum.FastLeaderElection)
>> May 03 14:18:56 machine-a zookeeper[27670]: [2023-05-03 14:18:56,725]
>> WARN Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn)
>> May 03 14:18:56 machine-a zookeeper[27670]: java.io.IOException:
>> ZooKeeperServer not running
>> May 03 14:18:56 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
>> May 03 14:18:56 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
>> May 03 14:18:56 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>> May 03 14:18:56 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>> May 03 14:18:56 machine-a zookeeper[27670]:         at
>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>> May 03 14:18:56 machine-a zookeeper[27670]:         at
>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>> May 03 14:18:56 machine-a zookeeper[27670]:         at
>> java.base/java.lang.Thread.run(Thread.java:829)
>> May 03 14:18:57 machine-a zookeeper[27670]: [2023-05-03 14:18:57,403]
>> WARN Cannot open channel to 3 at election address
>> ip-XXX-XXX-XXX-XXX.eu-west-X.compute.internal/XXX.XXX.XXX.XXX:3888
>> (org.apache.zookeeper.server.quorum.QuorumCnxManager)
>> May 03 14:18:57 machine-a zookeeper[27670]:
>> java.net.SocketTimeoutException: connect timed out
>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>> java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>> java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>> java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>> java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>> java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>> java.base/java.net.Socket.connect(Socket.java:609)
>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:383)
>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>> org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:457)
>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>> java.base/java.lang.Thread.run(Thread.java:829)
>
>
>
> Thanks,
> João
>
>
> On Tue, May 23, 2023 at 7:06 PM Enrico Olivelli <[email protected]>
> wrote:
>
>> Joao,
>>
>> Il Mar 23 Mag 2023, 16:57 João Silva <[email protected]> ha scritto:
>>
>> > Hi all,
>> >
>> > I've configured a 3-node Kafka (2.13) cluster with Zookeeper (3.6.3),
>> with
>> > each Zookeeper instance living in the same machine as each Kafka broker
>> > (Java 11.0.18). Everything worked fine for a long long time.
>> >
>> > However, the first time a machine failed (so, both an instance of
>> Zookeeper
>> > and a Kafka broker), the other 2 were unable to continue working (in
>> this
>> > case, the leader failed). The 2 Zookeeper instances seemed like they
>> > couldn't communicate with each other, and were unable to elect a new
>> > leader. But that doesn't make sense, because they were communicating
>> with
>> > each other before the failure. Only when the failing machine was booted
>> up
>> > again, the other 2 machines were able to elect a new leader.
>> >
>> > From the logs, I don't get much more information than what I explained
>> > above. The 2 living machines act like they don't "see" each other, and
>> are
>> > unable to elect a leader. When the failing machine goes up again, they
>> > manage to elect a new leader.
>> >
>>
>> Are you on some managed environment like k8s?
>> Are you able to ssh into the nodes and use nc or any other tools to try to
>> connect to the other machines?
>>
>>
>> Enrico
>>
>>
>>
>> > Does anyone can help me shed some light on this problem?
>> >
>> > Is there some configuration property I'm missing?
>> >
>> > From my internet crawl, I got these 2 articles with problems similar to
>> > mine, but they don't give a clear answer to why this happened and how to
>> > fix it:
>> >
>> >
>> >
>> https://stackoverflow.com/questions/54005488/zookeeper-issue-taking-15-minutes-to-recover-if-leader-is-killed
>> >
>> >
>> >
>> https://servicesunavailable.wordpress.com/2014/11/11/zookeeper-leader-election-and-timeouts/
>> >
>> > Thanks in advance,
>> > Joao
>> >
>>
>

Re: 3-node Zookeeper ensemble unable to recover if leader fails

Reply via email to