Weird broker lock-up causing an almost global downtime

Vincent Rischmann Tue, 27 Jun 2017 11:47:34 -0700
Hello. so I had a weird problem this afternoon. I was deploying a
streams application and wanted to delete already existing internal
states data so I ran kafka-streams-application-reset.sh to do it, as
recommended. it wasn't the first time I ran it and it had always worked
before, in staging or in production.
Anyway, I run the command and around 2/3 minutes later we realize a lot
of stuff using the cluster is basically down, unable to fetch or
produce. After investigating logs from the producers and the brokers I
saw that one broker was not responding, despite the process being up. It
kept spewing `UnknownTopicOrPartitionException` in the logs, other
brokers were spewing `NotLeaderForPartitionException` regularly. A
zookeeper node logged a lot of this:
> 2017-06-27 15:51:32,897 [myid:2] - INFO  [ProcessThread(sid:2 cport:-
> 1)::PrepRequestProcessor@649] - Got user-level KeeperException when
> processing sessionid:0x159cadf860e0089 type:setData cxid:0x249af08
> zxid:0xb06b3722e txntype:-1 reqpath:n/a Error 
> Path:/brokers/topics/event-counter-per-day-store-
> repartition/partitions/4/state Error:KeeperErrorCode = BadVersion for
> /brokers/topics/event-counter-per-day-store-
> repartition/partitions/4/state
So from my point of view it looked like that one broker was "down", not
responding to user requests but yet it was still seen as up by the
cluster and nobody could produce or fetch for the partitions it was
previously a leader. Running kafka-topics.sh --describe I also saw the
leader being -1 for a bunch of partitions.
 As soon as I `kill -9` the process, the cluster stabilized and
 everything went back to normal pretty much in seconds, producers were
 working again as well as consumers. After I restarted the broker, it
 joined the cluster, proceeded to actually do the topic deletions and
 rejoined correctly too.
I'm not sure what exactly happened but that was pretty scary. Has it
happened to anyone else ? My completely uneducated guess is that
somehow, using kafka-streams-application-reset.sh on an application with
5 internal topics caused too many deletions at once and thus caused a
broker to end up with a wrong zookeeper state ? I have no idea if that's
a plausible explanation.
Anyway, right now I think I'm going to stop using kafka-streams-application-
reset.sh and delete the topics one by one
Weird broker lock-up causing an almost global downtime

Reply via email to