[jira] [Created] (KAFKA-8151) Broker hangs and lockups after Zookeeper outages

Joe Ammann (JIRA) Sat, 23 Mar 2019 02:31:15 -0700

Joe Ammann created KAFKA-8151:
---------------------------------

             Summary: Broker hangs and lockups after Zookeeper outages
                 Key: KAFKA-8151
                 URL: https://issues.apache.org/jira/browse/KAFKA-8151
             Project: Kafka
          Issue Type: Bug
          Components: controller, core, zkclient
    Affects Versions: 2.1.1
            Reporter: Joe Ammann



We're running several clusters (mostly with 3 brokers) with 2.1.1, where we see 
at least 3 different symptoms, all resulting on broker/controller lockups.

We are pretty sure that the triggering cause for all these symptoms are 
temporary (for 3-5 minutes normally) of the Zookeeper cluster. The Linux VMs 
where the ZK nodes run on regularly get stalled for a couple of minutes. The ZK 
nodes always very quickly reunite and build a Quorum after the situation 
clears, but the Kafka brokers (which run on then same Linux VMs) quite often 
show problems after this procedure.

I've seen 3 different kinds of problems (this is why I put "reproduce" in 
quotes, I can never predict what will happen)

# the brokers get their ZK sessions expired (obviously) and sometimes only 2 of 
3 re-register under /brokers/ids. The 3rd broker doesn't re-register for some 
reason (that's the problem I originally described)
# the brokers all re-register and re-elect a new controller. But that new 
controller does not fully work. For example it doesn't process partition 
reassignment requests and or does not transfer partition leadership after I 
kill a broker
# the previous controller gets "dead-locked" (it has 3-4 of the important 
controller threads in a lock) and hence does not perform any of it's controller 
duties. But it regards itsself still as the valid controller and is accepted by 
the other brokers

I'll try to describe each one of the problems in more detail below, and hope to 
be able to cleary separate them. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KAFKA-8151) Broker hangs and lockups after Zookeeper outages

Reply via email to