Joao, Il Mar 23 Mag 2023, 16:57 João Silva <[email protected]> ha scritto:
> Hi all, > > I've configured a 3-node Kafka (2.13) cluster with Zookeeper (3.6.3), with > each Zookeeper instance living in the same machine as each Kafka broker > (Java 11.0.18). Everything worked fine for a long long time. > > However, the first time a machine failed (so, both an instance of Zookeeper > and a Kafka broker), the other 2 were unable to continue working (in this > case, the leader failed). The 2 Zookeeper instances seemed like they > couldn't communicate with each other, and were unable to elect a new > leader. But that doesn't make sense, because they were communicating with > each other before the failure. Only when the failing machine was booted up > again, the other 2 machines were able to elect a new leader. > > From the logs, I don't get much more information than what I explained > above. The 2 living machines act like they don't "see" each other, and are > unable to elect a leader. When the failing machine goes up again, they > manage to elect a new leader. > Are you on some managed environment like k8s? Are you able to ssh into the nodes and use nc or any other tools to try to connect to the other machines? Enrico > Does anyone can help me shed some light on this problem? > > Is there some configuration property I'm missing? > > From my internet crawl, I got these 2 articles with problems similar to > mine, but they don't give a clear answer to why this happened and how to > fix it: > > > https://stackoverflow.com/questions/54005488/zookeeper-issue-taking-15-minutes-to-recover-if-leader-is-killed > > > https://servicesunavailable.wordpress.com/2014/11/11/zookeeper-leader-election-and-timeouts/ > > Thanks in advance, > Joao >
