I'd suggest filing a single JIRA as a first step. Please test the 2.2 RC before filing if possible. Please include enough details for someone else to reproduce.
Thanks! Ismael On Fri, Mar 22, 2019, 3:14 PM Joe Ammann <j...@pyx.ch> wrote: > Hi Ismael > > I've done a few more tests, and it seems that I'm able to "reproduce" > various kinds of problems in Kafka 2.1.1 in out DEV. I can force these by > faking an outage of Zookeeper. What I do for my tests is freeze (kill > -STOP) 2 out of 3 ZK instances, let the Kafka brokers continue, then thaw > the ZK instances (kill -CONT) and see what happens. > > The ZK nodes always very quickly reunite and build a Quorum after thawing. > > But the Kafka brokers (running on the same 3 Linux VMs) quite often show > problems after this procedure (most of the time they successfully > re-register and continue to work). I've seen 3 different kinds of problems > (this is why I put "reproduce" in quotes, I can never predict what will > happen) > > - the brokers get their ZK sessions expired (obviously) and sometimes only > 2 of 3 re-register under /brokers/ids. The 3rd broker doesn't re-register > for some reason (that's the problem I originally described) > - the brokers all re-register and re-elect a new controller. But that new > controller does not fully work. For example it doesn't process partition > reassignment requests and or does not transfer partition leadership after I > kill a broker > - the previous controller gets "dead-locked" (it has 3-4 of the important > controller threads in a lock) and hence does not perform any of it's > controller duties. But it regards itsself still as the valid controller and > is accepted by the other brokers > > We have seen variants of these behaviours in TEST and PROD during the last > days. Of course there not provoked by kill -STOP, but rather by the stalled > underlying Linux VMs (we're heavily working on getting those replaced by > bare metal, but it may take some time). > > Before I start filing JIRA's > > - I feel this behaviour is so totally wierd, that I hardly can believe > it's Kafka bugs. They should have hit the community really hard and have > been uncovered quickly. So I'm rather guessing I'm doing something terribly > wrong. Any clue what that might be? > - if I really start filing JIRA's should it rather be one single, or one > per error scenario > > On 3/21/19 4:05 PM, Ismael Juma wrote: > > Hi Joe, > > > > This is not expected behaviour, please file a JIRA. > > > > Ismael > > > > On Mon, Mar 18, 2019 at 7:29 AM Joe Ammann <j...@pyx.ch <mailto: > j...@pyx.ch>> wrote: > > > > Hi all > > > > We're running several clusters (mostly with 3 brokers) with 2.1.1 > > > > We quite regularly see the pattern that one of the 3 brokers > "detaches" from ZK (the broker id is not registered anymore under > /brokers/ids). We assume that the root cause for this is that the brokers > are running on VMs (due to company policy, no alternative) and that the VM > gets "stalled" for several minutes due to missing resources on the VMware > ESX host. > > > > This is not new behaviour with 2.1.1, we already saw it with > 0.10.2.1 before. > > > -- > CU, Joe >