Hi all, Can anyone help me understand this behaviour? I was not able to get any help from my searches...
Thanks, Joao On Wed, May 24, 2023 at 10:41 AM João Silva <[email protected]> wrote: > Hi, > > Some additional information regarding this problem. > > This only happens sometimes (not always). > This happened the first time when a machine failed on AWS (after being > working for a very long time; months). > Then I tried to replicate the issue by stopping a machine, and it didn't > work (i.e., the problem didn't manifest). > After some days, I tried to replicate the issue and it happened again (but > not on the subsequent tries right after). > So, this shouldn't be a configuration issue. > It leads me to think that this is related to some problem in the Zookeeper > state... > Only after some time, the failure occurs. And it does not happen if I try > to replicate it right after the time the machine fails. > > Thanks, > João > > > > On Wed, May 24, 2023 at 9:24 AM João Silva <[email protected]> wrote: > >> Hi Enrico, >> >> No, I'm not on a managed environment. >> Both Zookeeper and Kafka are installed directly on the machines in AWS. >> And I can ssh into the nodes and nc among the machines. >> They can communicate among them on the Zookeeper ports. >> That's what makes this so weird. >> >> I'm not sure if this helps, but on the logs; while the failed machine is >> down, I keep seeing this: >> (Machine-a has id 2. The failing machine is machine-b with id 3. The >> other healthy machine is machine-c with id 4.) >> >> May 03 14:18:40 machine-a zookeeper[27670]: [2023-05-03 14:18:40,060] >>> WARN Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn) >>> May 03 14:18:40 machine-a zookeeper[27670]: java.io.IOException: >>> ZooKeeperServer not running >>> May 03 14:18:40 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) >>> May 03 14:18:40 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) >>> May 03 14:18:40 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) >>> May 03 14:18:40 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) >>> May 03 14:18:40 machine-a zookeeper[27670]: at >>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >>> May 03 14:18:40 machine-a zookeeper[27670]: at >>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >>> May 03 14:18:40 machine-a zookeeper[27670]: at >>> java.base/java.lang.Thread.run(Thread.java:829) >>> May 03 14:18:47 machine-a zookeeper[27670]: [2023-05-03 14:18:47,988] >>> WARN Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn) >>> May 03 14:18:47 machine-a zookeeper[27670]: java.io.IOException: >>> ZooKeeperServer not running >>> May 03 14:18:47 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) >>> May 03 14:18:47 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) >>> May 03 14:18:47 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) >>> May 03 14:18:47 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) >>> May 03 14:18:47 machine-a zookeeper[27670]: at >>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >>> May 03 14:18:47 machine-a zookeeper[27670]: at >>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >>> May 03 14:18:47 machine-a zookeeper[27670]: at >>> java.base/java.lang.Thread.run(Thread.java:829) >>> May 03 14:18:49 machine-a zookeeper[27670]: [2023-05-03 14:18:49,397] >>> WARN Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn) >>> May 03 14:18:49 machine-a zookeeper[27670]: java.io.IOException: >>> ZooKeeperServer not running >>> May 03 14:18:49 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) >>> May 03 14:18:49 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) >>> May 03 14:18:49 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) >>> May 03 14:18:49 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) >>> May 03 14:18:49 machine-a zookeeper[27670]: at >>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >>> May 03 14:18:49 machine-a zookeeper[27670]: at >>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >>> May 03 14:18:49 machine-a zookeeper[27670]: at >>> java.base/java.lang.Thread.run(Thread.java:829) >>> May 03 14:18:52 machine-a zookeeper[27670]: [2023-05-03 14:18:52,395] >>> INFO Notification time out: 60000 >>> (org.apache.zookeeper.server.quorum.FastLeaderElection) >>> May 03 14:18:52 machine-a zookeeper[27670]: [2023-05-03 14:18:52,398] >>> INFO Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:2, >>> n.round:0x7, n.peerEpoch:0x5, n.zxid:0x5000005a0, message format >>> version:0x2, n.config version:0x0 >>> (org.apache.zookeeper.server.quorum.FastLeaderElection) >>> May 03 14:18:56 machine-a zookeeper[27670]: [2023-05-03 14:18:56,725] >>> WARN Close of session 0x0 (org.apache.zookeeper.server.NIOServerCnxn) >>> May 03 14:18:56 machine-a zookeeper[27670]: java.io.IOException: >>> ZooKeeperServer not running >>> May 03 14:18:56 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) >>> May 03 14:18:56 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) >>> May 03 14:18:56 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) >>> May 03 14:18:56 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) >>> May 03 14:18:56 machine-a zookeeper[27670]: at >>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >>> May 03 14:18:56 machine-a zookeeper[27670]: at >>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >>> May 03 14:18:56 machine-a zookeeper[27670]: at >>> java.base/java.lang.Thread.run(Thread.java:829) >>> May 03 14:18:57 machine-a zookeeper[27670]: [2023-05-03 14:18:57,403] >>> WARN Cannot open channel to 3 at election address >>> ip-XXX-XXX-XXX-XXX.eu-west-X.compute.internal/XXX.XXX.XXX.XXX:3888 >>> (org.apache.zookeeper.server.quorum.QuorumCnxManager) >>> May 03 14:18:57 machine-a zookeeper[27670]: >>> java.net.SocketTimeoutException: connect timed out >>> May 03 14:18:57 machine-a zookeeper[27670]: at >>> java.base/java.net.PlainSocketImpl.socketConnect(Native Method) >>> May 03 14:18:57 machine-a zookeeper[27670]: at >>> java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412) >>> May 03 14:18:57 machine-a zookeeper[27670]: at >>> java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255) >>> May 03 14:18:57 machine-a zookeeper[27670]: at >>> java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237) >>> May 03 14:18:57 machine-a zookeeper[27670]: at >>> java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) >>> May 03 14:18:57 machine-a zookeeper[27670]: at >>> java.base/java.net.Socket.connect(Socket.java:609) >>> May 03 14:18:57 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:383) >>> May 03 14:18:57 machine-a zookeeper[27670]: at >>> org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:457) >>> May 03 14:18:57 machine-a zookeeper[27670]: at >>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) >>> May 03 14:18:57 machine-a zookeeper[27670]: at >>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) >>> May 03 14:18:57 machine-a zookeeper[27670]: at >>> java.base/java.lang.Thread.run(Thread.java:829) >> >> >> >> Thanks, >> João >> >> >> On Tue, May 23, 2023 at 7:06 PM Enrico Olivelli <[email protected]> >> wrote: >> >>> Joao, >>> >>> Il Mar 23 Mag 2023, 16:57 João Silva <[email protected]> ha scritto: >>> >>> > Hi all, >>> > >>> > I've configured a 3-node Kafka (2.13) cluster with Zookeeper (3.6.3), >>> with >>> > each Zookeeper instance living in the same machine as each Kafka broker >>> > (Java 11.0.18). Everything worked fine for a long long time. >>> > >>> > However, the first time a machine failed (so, both an instance of >>> Zookeeper >>> > and a Kafka broker), the other 2 were unable to continue working (in >>> this >>> > case, the leader failed). The 2 Zookeeper instances seemed like they >>> > couldn't communicate with each other, and were unable to elect a new >>> > leader. But that doesn't make sense, because they were communicating >>> with >>> > each other before the failure. Only when the failing machine was >>> booted up >>> > again, the other 2 machines were able to elect a new leader. >>> > >>> > From the logs, I don't get much more information than what I explained >>> > above. The 2 living machines act like they don't "see" each other, and >>> are >>> > unable to elect a leader. When the failing machine goes up again, they >>> > manage to elect a new leader. >>> > >>> >>> Are you on some managed environment like k8s? >>> Are you able to ssh into the nodes and use nc or any other tools to try >>> to >>> connect to the other machines? >>> >>> >>> Enrico >>> >>> >>> >>> > Does anyone can help me shed some light on this problem? >>> > >>> > Is there some configuration property I'm missing? >>> > >>> > From my internet crawl, I got these 2 articles with problems similar to >>> > mine, but they don't give a clear answer to why this happened and how >>> to >>> > fix it: >>> > >>> > >>> > >>> https://stackoverflow.com/questions/54005488/zookeeper-issue-taking-15-minutes-to-recover-if-leader-is-killed >>> > >>> > >>> > >>> https://servicesunavailable.wordpress.com/2014/11/11/zookeeper-leader-election-and-timeouts/ >>> > >>> > Thanks in advance, >>> > Joao >>> > >>> >>
