[ https://issues.apache.org/jira/browse/KAFKA-8151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802264#comment-16802264 ]
Joe Ammann edited comment on KAFKA-8151 at 3/26/19 11:14 PM: ------------------------------------------------------------- I think symptom 2 was/is actually bug KAFKA-1120 Checking the logs again I have of those occurences, my description was wrong. It was not that the controller did not work at all, it was really just ignoring 1 broker. The problem could be resolved by either restarting the controller, or by restarting the broker that was being ignored by the controller. was (Author: jammann): I think scenario 2 was/is actually bug KAFKA-1120 Checking the logs again I have of those occurences, my description was wrong. It was not that the controller did not work at all, it was really just ignoring 1 broker. The problem could be resolved by either restarting the controller, or by restarting the broker that was being ignored by the controller. > Broker hangs and lockups after Zookeeper outages > ------------------------------------------------ > > Key: KAFKA-8151 > URL: https://issues.apache.org/jira/browse/KAFKA-8151 > Project: Kafka > Issue Type: Bug > Components: controller, core, zkclient > Affects Versions: 2.1.1 > Reporter: Joe Ammann > Priority: Major > Attachments: symptom3_lxgurten_kafka_dump1.txt, > symptom3_lxgurten_kafka_dump2.txt, symptom3_lxgurten_kafka_dump3.txt > > > We're running several clusters (mostly with 3 brokers) with 2.1.1, where we > see at least 3 different symptoms, all resulting on broker/controller lockups. > We are pretty sure that the triggering cause for all these symptoms are > temporary (for 3-5 minutes normally) of the Zookeeper cluster. The Linux VMs > where the ZK nodes run on regularly get stalled for a couple of minutes. The > ZK nodes always very quickly reunite and build a Quorum after the situation > clears, but the Kafka brokers (which run on then same Linux VMs) quite often > show problems after this procedure. > I've seen 3 different kinds of problems (this is why I put "reproduce" in > quotes, I can never predict what will happen) > # the brokers get their ZK sessions expired (obviously) and sometimes only 2 > of 3 re-register under /brokers/ids. The 3rd broker doesn't re-register for > some reason (that's the problem I originally described) > # the brokers all re-register and re-elect a new controller. But that new > controller does not fully work. For example it doesn't process partition > reassignment requests and or does not transfer partition leadership after I > kill a broker > # the previous controller gets "dead-locked" (it has 3-4 of the important > controller threads in a lock) and hence does not perform any of it's > controller duties. But it regards itsself still as the valid controller and > is accepted by the other brokers > I'll try to describe each one of the problems in more detail below, and hope > to be able to cleary separate them. > I'm able to provoke these problems in our DEV environment quite regularly > using the following procedure > * make sure all ZK nodes and Kafka brokers are stable and reacting normally > * freeze 2 out of 3 ZK nodes with {{kill -STOP}} for some minutes > * let the Kafka broker running, of course they will start complaining to be > unable to reach ZK > * thaw the processes with {{kill -CONT}} > * now all Kafka brokers get notified that their ZK session has expired, and > they start to reorganize the cluster > In about 20% of the tests, I'm able to produce one of the symptoms above. I > can not predict which one though. I'm varying this procedure sometimes by > also freezing one Kafka broker (most often the controller), but until now I > haven't been able to create a clear pattern or really force one specific > symptom > -- This message was sent by Atlassian JIRA (v7.6.3#76005)