3-node Zookeeper ensemble unable to recover if leader fails

João Silva Tue, 23 May 2023 07:57:51 -0700

Hi all,

I've configured a 3-node Kafka (2.13) cluster with Zookeeper (3.6.3), with
each Zookeeper instance living in the same machine as each Kafka broker
(Java 11.0.18). Everything worked fine for a long long time.


However, the first time a machine failed (so, both an instance of Zookeeper
and a Kafka broker), the other 2 were unable to continue working (in this
case, the leader failed). The 2 Zookeeper instances seemed like they
couldn't communicate with each other, and were unable to elect a new
leader. But that doesn't make sense, because they were communicating with
each other before the failure. Only when the failing machine was booted up
again, the other 2 machines were able to elect a new leader.

>From the logs, I don't get much more information than what I explained
above. The 2 living machines act like they don't "see" each other, and are
unable to elect a leader. When the failing machine goes up again, they
manage to elect a new leader.

Does anyone can help me shed some light on this problem?

Is there some configuration property I'm missing?

>From my internet crawl, I got these 2 articles with problems similar to
mine, but they don't give a clear answer to why this happened and how to
fix it:

https://stackoverflow.com/questions/54005488/zookeeper-issue-taking-15-minutes-to-recover-if-leader-is-killed

https://servicesunavailable.wordpress.com/2014/11/11/zookeeper-leader-election-and-timeouts/

Thanks in advance,
Joao

3-node Zookeeper ensemble unable to recover if leader fails

Reply via email to