Re: 3-node Zookeeper ensemble unable to recover if leader fails

João Silva Mon, 03 Jul 2023 01:29:50 -0700

Hi all,

Can anyone help me understand this behaviour?
I was not able to get any help from my searches...


Thanks,
Joao


On Wed, May 24, 2023 at 10:41 AM João Silva <[email protected]> wrote:

> Hi,
>
> Some additional information regarding this problem.
>
> This only happens sometimes (not always).
> This happened the first time when a machine failed on AWS (after being
> working for a very long time; months).
> Then I tried to replicate the issue by stopping a machine, and it didn't
> work (i.e., the problem didn't manifest).
> After some days, I tried to replicate the issue and it happened again (but
> not on the subsequent tries right after).
> So, this shouldn't be a configuration issue.
> It leads me to think that this is related to some problem in the Zookeeper
> state...
> Only after some time, the failure occurs. And it does not happen if I try
> to replicate it right after the time the machine fails.
>
> Thanks,
> João
>
>
>
> On Wed, May 24, 2023 at 9:24 AM João Silva <[email protected]> wrote:
>
>> Hi Enrico,
>>
>> No, I'm not on a managed environment.
>> Both Zookeeper and Kafka are installed directly on the machines in AWS.
>> And I can ssh into the nodes and nc among the machines.
>> They can communicate among them on the Zookeeper ports.
>> That's what makes this so weird.
>>
>> I'm not sure if this helps, but on the logs; while the failed machine is
>> down, I keep seeing this:
>> (Machine-a has id 2. The failing machine is machine-b with id 3. The
>> other healthy machine is machine-c with id 4.)
>>
>> May 03 14:18:40 machine-a zookeeper[27670]: [2023-05-03 14:18:40,060]
>>> WARN Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn)
>>> May 03 14:18:40 machine-a zookeeper[27670]: java.io.IOException:
>>> ZooKeeperServer not running
>>> May 03 14:18:40 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
>>> May 03 14:18:40 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
>>> May 03 14:18:40 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>>> May 03 14:18:40 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>>> May 03 14:18:40 machine-a zookeeper[27670]:         at
>>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>> May 03 14:18:40 machine-a zookeeper[27670]:         at
>>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>> May 03 14:18:40 machine-a zookeeper[27670]:         at
>>> java.base/java.lang.Thread.run(Thread.java:829)
>>> May 03 14:18:47 machine-a zookeeper[27670]: [2023-05-03 14:18:47,988]
>>> WARN Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn)
>>> May 03 14:18:47 machine-a zookeeper[27670]: java.io.IOException:
>>> ZooKeeperServer not running
>>> May 03 14:18:47 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
>>> May 03 14:18:47 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
>>> May 03 14:18:47 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>>> May 03 14:18:47 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>>> May 03 14:18:47 machine-a zookeeper[27670]:         at
>>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>> May 03 14:18:47 machine-a zookeeper[27670]:         at
>>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>> May 03 14:18:47 machine-a zookeeper[27670]:         at
>>> java.base/java.lang.Thread.run(Thread.java:829)
>>> May 03 14:18:49 machine-a zookeeper[27670]: [2023-05-03 14:18:49,397]
>>> WARN Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn)
>>> May 03 14:18:49 machine-a zookeeper[27670]: java.io.IOException:
>>> ZooKeeperServer not running
>>> May 03 14:18:49 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
>>> May 03 14:18:49 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
>>> May 03 14:18:49 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>>> May 03 14:18:49 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>>> May 03 14:18:49 machine-a zookeeper[27670]:         at
>>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>> May 03 14:18:49 machine-a zookeeper[27670]:         at
>>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>> May 03 14:18:49 machine-a zookeeper[27670]:         at
>>> java.base/java.lang.Thread.run(Thread.java:829)
>>> May 03 14:18:52 machine-a zookeeper[27670]: [2023-05-03 14:18:52,395]
>>> INFO Notification time out: 60000
>>> (org.apache.zookeeper.server.quorum.FastLeaderElection)
>>> May 03 14:18:52 machine-a zookeeper[27670]: [2023-05-03 14:18:52,398]
>>> INFO Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:2,
>>> n.round:0x7, n.peerEpoch:0x5, n.zxid:0x5000005a0, message format
>>> version:0x2, n.config version:0x0
>>> (org.apache.zookeeper.server.quorum.FastLeaderElection)
>>> May 03 14:18:56 machine-a zookeeper[27670]: [2023-05-03 14:18:56,725]
>>> WARN Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn)
>>> May 03 14:18:56 machine-a zookeeper[27670]: java.io.IOException:
>>> ZooKeeperServer not running
>>> May 03 14:18:56 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
>>> May 03 14:18:56 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
>>> May 03 14:18:56 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>>> May 03 14:18:56 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>>> May 03 14:18:56 machine-a zookeeper[27670]:         at
>>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>> May 03 14:18:56 machine-a zookeeper[27670]:         at
>>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>> May 03 14:18:56 machine-a zookeeper[27670]:         at
>>> java.base/java.lang.Thread.run(Thread.java:829)
>>> May 03 14:18:57 machine-a zookeeper[27670]: [2023-05-03 14:18:57,403]
>>> WARN Cannot open channel to 3 at election address
>>> ip-XXX-XXX-XXX-XXX.eu-west-X.compute.internal/XXX.XXX.XXX.XXX:3888
>>> (org.apache.zookeeper.server.quorum.QuorumCnxManager)
>>> May 03 14:18:57 machine-a zookeeper[27670]:
>>> java.net.SocketTimeoutException: connect timed out
>>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>>> java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
>>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>>> java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
>>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>>> java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
>>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>>> java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
>>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>>> java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>>> java.base/java.net.Socket.connect(Socket.java:609)
>>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:383)
>>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>>> org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:457)
>>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>>> May 03 14:18:57 machine-a zookeeper[27670]:         at
>>> java.base/java.lang.Thread.run(Thread.java:829)
>>
>>
>>
>> Thanks,
>> João
>>
>>
>> On Tue, May 23, 2023 at 7:06 PM Enrico Olivelli <[email protected]>
>> wrote:
>>
>>> Joao,
>>>
>>> Il Mar 23 Mag 2023, 16:57 João Silva <[email protected]> ha scritto:
>>>
>>> > Hi all,
>>> >
>>> > I've configured a 3-node Kafka (2.13) cluster with Zookeeper (3.6.3),
>>> with
>>> > each Zookeeper instance living in the same machine as each Kafka broker
>>> > (Java 11.0.18). Everything worked fine for a long long time.
>>> >
>>> > However, the first time a machine failed (so, both an instance of
>>> Zookeeper
>>> > and a Kafka broker), the other 2 were unable to continue working (in
>>> this
>>> > case, the leader failed). The 2 Zookeeper instances seemed like they
>>> > couldn't communicate with each other, and were unable to elect a new
>>> > leader. But that doesn't make sense, because they were communicating
>>> with
>>> > each other before the failure. Only when the failing machine was
>>> booted up
>>> > again, the other 2 machines were able to elect a new leader.
>>> >
>>> > From the logs, I don't get much more information than what I explained
>>> > above. The 2 living machines act like they don't "see" each other, and
>>> are
>>> > unable to elect a leader. When the failing machine goes up again, they
>>> > manage to elect a new leader.
>>> >
>>>
>>> Are you on some managed environment like k8s?
>>> Are you able to ssh into the nodes and use nc or any other tools to try
>>> to
>>> connect to the other machines?
>>>
>>>
>>> Enrico
>>>
>>>
>>>
>>> > Does anyone can help me shed some light on this problem?
>>> >
>>> > Is there some configuration property I'm missing?
>>> >
>>> > From my internet crawl, I got these 2 articles with problems similar to
>>> > mine, but they don't give a clear answer to why this happened and how
>>> to
>>> > fix it:
>>> >
>>> >
>>> >
>>> https://stackoverflow.com/questions/54005488/zookeeper-issue-taking-15-minutes-to-recover-if-leader-is-killed
>>> >
>>> >
>>> >
>>> https://servicesunavailable.wordpress.com/2014/11/11/zookeeper-leader-election-and-timeouts/
>>> >
>>> > Thanks in advance,
>>> > Joao
>>> >
>>>
>>

Re: 3-node Zookeeper ensemble unable to recover if leader fails

Reply via email to