By default, num.replica.fetchers = 1. This means only one thread per broker
is fetching data from leaders. This means it make take a while for the
recovering machine to catch up and rejoin the ISR.

If you have bandwidth to spare, try increasing this value.

Regarding "no data flowing into kafka" - If you have 3 replicas and only
one is down, I'd expect writes to continue to the new leader even if one
replica is not in the ISR yet. Can you see that a new leader is elected?

Gwen

On Fri, Aug 21, 2015 at 6:50 AM, Jörg Wagner <joerg.wagn...@1und1.de> wrote:

> Hey everyone,
>
> here's my crosspost from irc.
>
> Our setup:
> 3 kafka 0.8.2 brokers with zookeeper, powerful hardware (20 cores, 27
> logdisks each). We use a handful of topics, but only one topic is utilized
> heavily. It features a replication of 2 and 600 partitions.
>
> Our issue:
> If one kafka was down, it takes very long ( from 1 to >10 hours) to show
> that all partitions have all isr again. This seems to heavily depend on the
> amount of data which is in the log.dirs (I have configured 27 threads - one
> for each dir featuring a own drive).
> This all takes this long while there is NO data flowing into kafka.
>
> We seem to be missing something critical here. It might be some option set
> wrong, or are we thinking wrong and it's not critical to have the replicas
> in sync.
>
> Any pointers would be great.
>
> Cheers
> Jörg
>

Reply via email to