Joe Ammann created KAFKA-8151: --------------------------------- Summary: Broker hangs and lockups after Zookeeper outages Key: KAFKA-8151 URL: https://issues.apache.org/jira/browse/KAFKA-8151 Project: Kafka Issue Type: Bug Components: controller, core, zkclient Affects Versions: 2.1.1 Reporter: Joe Ammann
We're running several clusters (mostly with 3 brokers) with 2.1.1, where we see at least 3 different symptoms, all resulting on broker/controller lockups. We are pretty sure that the triggering cause for all these symptoms are temporary (for 3-5 minutes normally) of the Zookeeper cluster. The Linux VMs where the ZK nodes run on regularly get stalled for a couple of minutes. The ZK nodes always very quickly reunite and build a Quorum after the situation clears, but the Kafka brokers (which run on then same Linux VMs) quite often show problems after this procedure. I've seen 3 different kinds of problems (this is why I put "reproduce" in quotes, I can never predict what will happen) # the brokers get their ZK sessions expired (obviously) and sometimes only 2 of 3 re-register under /brokers/ids. The 3rd broker doesn't re-register for some reason (that's the problem I originally described) # the brokers all re-register and re-elect a new controller. But that new controller does not fully work. For example it doesn't process partition reassignment requests and or does not transfer partition leadership after I kill a broker # the previous controller gets "dead-locked" (it has 3-4 of the important controller threads in a lock) and hence does not perform any of it's controller duties. But it regards itsself still as the valid controller and is accepted by the other brokers I'll try to describe each one of the problems in more detail below, and hope to be able to cleary separate them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)