I've filed https://issues.apache.org/jira/browse/KAFKA-8151, and tried to keep the descriptions of the different symptoms apart.
I have yet to collect detailed information about a case of symptom 2 to happen. And I will try 2.2RC later today CU, Joe On 3/23/19 1:17 AM, Ismael Juma wrote: > I'd suggest filing a single JIRA as a first step. Please test the 2.2 RC > before filing if possible. Please include enough details for someone else to > reproduce. > > Thanks! > > Ismael > > On Fri, Mar 22, 2019, 3:14 PM Joe Ammann <j...@pyx.ch <mailto:j...@pyx.ch>> > wrote: > > Hi Ismael > > I've done a few more tests, and it seems that I'm able to "reproduce" > various kinds of problems in Kafka 2.1.1 in out DEV. I can force these by > faking an outage of Zookeeper. What I do for my tests is freeze (kill -STOP) > 2 out of 3 ZK instances, let the Kafka brokers continue, then thaw the ZK > instances (kill -CONT) and see what happens. > > The ZK nodes always very quickly reunite and build a Quorum after thawing. > > But the Kafka brokers (running on the same 3 Linux VMs) quite often show > problems after this procedure (most of the time they successfully re-register > and continue to work). I've seen 3 different kinds of problems (this is why I > put "reproduce" in quotes, I can never predict what will happen) > > - the brokers get their ZK sessions expired (obviously) and sometimes > only 2 of 3 re-register under /brokers/ids. The 3rd broker doesn't > re-register for some reason (that's the problem I originally described) > - the brokers all re-register and re-elect a new controller. But that new > controller does not fully work. For example it doesn't process partition > reassignment requests and or does not transfer partition leadership after I > kill a broker > - the previous controller gets "dead-locked" (it has 3-4 of the important > controller threads in a lock) and hence does not perform any of it's > controller duties. But it regards itsself still as the valid controller and > is accepted by the other brokers > > We have seen variants of these behaviours in TEST and PROD during the > last days. Of course there not provoked by kill -STOP, but rather by the > stalled underlying Linux VMs (we're heavily working on getting those replaced > by bare metal, but it may take some time). > > Before I start filing JIRA's > > - I feel this behaviour is so totally wierd, that I hardly can believe > it's Kafka bugs. They should have hit the community really hard and have been > uncovered quickly. So I'm rather guessing I'm doing something terribly wrong. > Any clue what that might be? > - if I really start filing JIRA's should it rather be one single, or one > per error scenario > > On 3/21/19 4:05 PM, Ismael Juma wrote: > > Hi Joe, > > > > This is not expected behaviour, please file a JIRA. > > > > Ismael > > > > On Mon, Mar 18, 2019 at 7:29 AM Joe Ammann <j...@pyx.ch > <mailto:j...@pyx.ch> <mailto:j...@pyx.ch <mailto:j...@pyx.ch>>> wrote: > > > > Hi all > > > > We're running several clusters (mostly with 3 brokers) with 2.1.1 > > > > We quite regularly see the pattern that one of the 3 brokers > "detaches" from ZK (the broker id is not registered anymore under > /brokers/ids). We assume that the root cause for this is that the brokers are > running on VMs (due to company policy, no alternative) and that the VM gets > "stalled" for several minutes due to missing resources on the VMware ESX host. > > > > This is not new behaviour with 2.1.1, we already saw it with > 0.10.2.1 before. > > > -- > CU, Joe > -- CU, Joe