Re: Weird broker lock-up causing an almost global downtime

Bill Bejeck Tue, 27 Jun 2017 15:18:06 -0700

Thanks Vincent.

That's a good start for now.


If you get a chance to forward some logs that would be great.

-Bill

On Tue, Jun 27, 2017 at 6:10 PM, Vincent Rischmann <m...@vrischmann.me> wrote:

> Sure.
>
> The application reads a topic of keyless events and based on some
> criteria of the event, it creates a new key and uses that for
> `selectKey`.
> Then I groupByKey and count with 3 differents windows. Each count is
> then stored in a database.
>
> The three windows are tumbling windows:
>   - 1 minute window, 1 day retention
>   - 1 hour window, 7 day retention
>   - 1 day window, 15 days retention
> That's basically it for the structure.
> The input topic has 64 partitions.
>
> Tomorrow I can get some logs from Kafka/Zookeeper if that would help.
>
> On Tue, Jun 27, 2017, at 11:41 PM, Bill Bejeck wrote:
> > Hi Vincent,
> >
> > Thanks for reporting this issue.  Could you give us some more details
> > (number topics, partitions per topic and the structure of your Kafka
> > Streams application) so we attempt to reproduce and diagnose the issue?
> >
> > Thanks!
> > Bill
> >
> > On 2017-06-27 14:46 (-0400), Vincent Rischmann <m...@vrischmann.me> wrote:
> > > Hello. so I had a weird problem this afternoon. I was deploying a
> > > streams application and wanted to delete already existing internal
> > > states data so I ran kafka-streams-application-reset.sh to do it, as
> > > recommended. it wasn't the first time I ran it and it had always worked
> > > before, in staging or in production.
> > > Anyway, I run the command and around 2/3 minutes later we realize a lot
> > > of stuff using the cluster is basically down, unable to fetch or
> > > produce. After investigating logs from the producers and the brokers I
> > > saw that one broker was not responding, despite the process being up.
> It
> > > kept spewing `UnknownTopicOrPartitionException` in the logs, other
> > > brokers were spewing `NotLeaderForPartitionException` regularly. A
> > > zookeeper node logged a lot of this:
> > > > 2017-06-27 15:51:32,897 [myid:2] - INFO  [ProcessThread(sid:2 cport:-
> > > > 1)::PrepRequestProcessor@649] - Got user-level KeeperException when
> > > > processing sessionid:0x159cadf860e0089 type:setData cxid:0x249af08
> > > > zxid:0xb06b3722e txntype:-1 reqpath:n/a Error
> Path:/brokers/topics/event-counter-per-day-store-
> > > > repartition/partitions/4/state Error:KeeperErrorCode = BadVersion for
> > > > /brokers/topics/event-counter-per-day-store-
> > > > repartition/partitions/4/state
> > > So from my point of view it looked like that one broker was "down", not
> > > responding to user requests but yet it was still seen as up by the
> > > cluster and nobody could produce or fetch for the partitions it was
> > > previously a leader. Running kafka-topics.sh --describe I also saw the
> > > leader being -1 for a bunch of partitions.
> > >  As soon as I `kill -9` the process, the cluster stabilized and
> > >  everything went back to normal pretty much in seconds, producers were
> > >  working again as well as consumers. After I restarted the broker, it
> > >  joined the cluster, proceeded to actually do the topic deletions and
> > >  rejoined correctly too.
> > > I'm not sure what exactly happened but that was pretty scary. Has it
> > > happened to anyone else ? My completely uneducated guess is that
> > > somehow, using kafka-streams-application-reset.sh on an application
> with
> > > 5 internal topics caused too many deletions at once and thus caused a
> > > broker to end up with a wrong zookeeper state ? I have no idea if
> that's
> > > a plausible explanation.
> > > Anyway, right now I think I'm going to stop using
> kafka-streams-application-
> > > reset.sh and delete the topics one by one
> > >
>

Re: Weird broker lock-up causing an almost global downtime

Reply via email to