Re: Painfully slow kafka recovery

Gwen Shapira Fri, 21 Aug 2015 11:51:02 -0700

I suspect that in general the broker may be busier since it needs to handle
more partitions now, and the extra replication.


It could be good to dig into the specifics of the latency - there's a
request log that you can turn on, I believe.

On Fri, Aug 21, 2015 at 10:18 AM, Rajasekar Elango <rela...@salesforce.com>
wrote:

> We are seeing same behavior in 5 broker cluster when losing one broker.
>
> In our case, we are losing broker as well as kafka data dir.
>
> Jörg Wagner,
>
> Are you losing just broker or kafka data dir as well?
>
> Gwen,
>
> We have also observed that latency of messages arriving at consumers goes
> up by 10x when we lose a broker. Is it because the broker is busy with
> handling failed fetch requests and loaded with more data thats slowing down
> the writes ? Also, if we had simply lost the broker not the data dir,
> impact would have been minimal?
>
> Thanks,
> Raja.
>
>
>
> On Fri, Aug 21, 2015 at 12:31 PM, Gwen Shapira <g...@confluent.io> wrote:
>
> > By default, num.replica.fetchers = 1. This means only one thread per
> broker
> > is fetching data from leaders. This means it make take a while for the
> > recovering machine to catch up and rejoin the ISR.
> >
> > If you have bandwidth to spare, try increasing this value.
> >
> > Regarding "no data flowing into kafka" - If you have 3 replicas and only
> > one is down, I'd expect writes to continue to the new leader even if one
> > replica is not in the ISR yet. Can you see that a new leader is elected?
> >
> > Gwen
> >
> > On Fri, Aug 21, 2015 at 6:50 AM, Jörg Wagner <joerg.wagn...@1und1.de>
> > wrote:
> >
> > > Hey everyone,
> > >
> > > here's my crosspost from irc.
> > >
> > > Our setup:
> > > 3 kafka 0.8.2 brokers with zookeeper, powerful hardware (20 cores, 27
> > > logdisks each). We use a handful of topics, but only one topic is
> > utilized
> > > heavily. It features a replication of 2 and 600 partitions.
> > >
> > > Our issue:
> > > If one kafka was down, it takes very long ( from 1 to >10 hours) to
> show
> > > that all partitions have all isr again. This seems to heavily depend on
> > the
> > > amount of data which is in the log.dirs (I have configured 27 threads -
> > one
> > > for each dir featuring a own drive).
> > > This all takes this long while there is NO data flowing into kafka.
> > >
> > > We seem to be missing something critical here. It might be some option
> > set
> > > wrong, or are we thinking wrong and it's not critical to have the
> > replicas
> > > in sync.
> > >
> > > Any pointers would be great.
> > >
> > > Cheers
> > > Jörg
> > >
> >
>
>
>
> --
> Thanks,
> Raja.
>

Re: Painfully slow kafka recovery

Reply via email to