By default, num.replica.fetchers = 1. This means only one thread per broker is fetching data from leaders. This means it make take a while for the recovering machine to catch up and rejoin the ISR.
If you have bandwidth to spare, try increasing this value. Regarding "no data flowing into kafka" - If you have 3 replicas and only one is down, I'd expect writes to continue to the new leader even if one replica is not in the ISR yet. Can you see that a new leader is elected? Gwen On Fri, Aug 21, 2015 at 6:50 AM, Jörg Wagner <joerg.wagn...@1und1.de> wrote: > Hey everyone, > > here's my crosspost from irc. > > Our setup: > 3 kafka 0.8.2 brokers with zookeeper, powerful hardware (20 cores, 27 > logdisks each). We use a handful of topics, but only one topic is utilized > heavily. It features a replication of 2 and 600 partitions. > > Our issue: > If one kafka was down, it takes very long ( from 1 to >10 hours) to show > that all partitions have all isr again. This seems to heavily depend on the > amount of data which is in the log.dirs (I have configured 27 threads - one > for each dir featuring a own drive). > This all takes this long while there is NO data flowing into kafka. > > We seem to be missing something critical here. It might be some option set > wrong, or are we thinking wrong and it's not critical to have the replicas > in sync. > > Any pointers would be great. > > Cheers > Jörg >