So okay, this is a little embarassing but the core of the issue was that max open files was not set correctly for kafka. It was not an oversight, but a few things together caused that the system configuration was not changed correctly, resulting in the default value.

No wonder that kafka behaved strangely everytime we had enough data in log.dirs and connections.

Anyhow, that doesn't seem to be the last problem. The brokers get in sync with each other (within an expected time frame), everything seems fine.

After a little stress testing, the kafka cluster falls apart (around 40k requests/s). Using topics describe we can see leaders missing (e.g. from 1,2,3 only 1 and 3 are leading partitions, although zookeeper lists all under /brokers/ids). This ultimately results in partitions being unavailable and massive "leader not local" spam in the logs.

What are we missing?

Cheers
Jörg

On 24.08.2015 10:31, Jörg Wagner wrote:
Thank you for your answers.

@Raja
No, it also seems to happen if we stop kafka completely clean.

@Gwen
I was testing the situation with num.replica.fetchers set higher. If you say that was the right direction, I will try it again. What would be a good setting? I went with 50 which seemed reasonable (having 27 single disks).
How long should it take to get complete ISR?

Regarding no Data flowing into kafka: I just wanted to point out that the setup is not yet live. So we can completely stop the usage of kafka, and it should possibly get into sync faster without a steady stream of new messages. Kafka itself is working fine during this on the other hand, "just" missing ISR, hence redundancy. If I stop another broker (clean!) though, it tends to happen that the expected number of partitions have Leader -1; which should not happen as I assume.

Cheers
Jörg

On 21.08.2015 19:18, Rajasekar Elango wrote:
We are seeing same behavior in 5 broker cluster when losing one broker.

In our case, we are losing broker as well as kafka data dir.

Jörg Wagner,

Are you losing just broker or kafka data dir as well?

Gwen,

We have also observed that latency of messages arriving at consumers goes
up by 10x when we lose a broker. Is it because the broker is busy with
handling failed fetch requests and loaded with more data thats slowing down
the writes ? Also, if we had simply lost the broker not the data dir,
impact would have been minimal?

Thanks,
Raja.



On Fri, Aug 21, 2015 at 12:31 PM, Gwen Shapira <g...@confluent.io> wrote:

By default, num.replica.fetchers = 1. This means only one thread per broker
is fetching data from leaders. This means it make take a while for the
recovering machine to catch up and rejoin the ISR.

If you have bandwidth to spare, try increasing this value.

Regarding "no data flowing into kafka" - If you have 3 replicas and only one is down, I'd expect writes to continue to the new leader even if one replica is not in the ISR yet. Can you see that a new leader is elected?

Gwen

On Fri, Aug 21, 2015 at 6:50 AM, Jörg Wagner <joerg.wagn...@1und1.de>
wrote:

Hey everyone,

here's my crosspost from irc.

Our setup:
3 kafka 0.8.2 brokers with zookeeper, powerful hardware (20 cores, 27
logdisks each). We use a handful of topics, but only one topic is
utilized
heavily. It features a replication of 2 and 600 partitions.

Our issue:
If one kafka was down, it takes very long ( from 1 to >10 hours) to show that all partitions have all isr again. This seems to heavily depend on
the
amount of data which is in the log.dirs (I have configured 27 threads -
one
for each dir featuring a own drive).
This all takes this long while there is NO data flowing into kafka.

We seem to be missing something critical here. It might be some option
set
wrong, or are we thinking wrong and it's not critical to have the
replicas
in sync.

Any pointers would be great.

Cheers
Jörg





--
Mit freundlichem Gruß

Jörg Wagner

Mobile & Services

1&1 Internet AG | Sapporobogen 6-8 | 80637 München | Germany
Phone: +49 89 14339 324
E-Mail: joerg.wagn...@1und1.de | Web: www.1und1.de

Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 6484

Vorstand: Ralph Dommermuth, Frank Einhellinger, Robert Hoffmann, Andreas 
Hofmann, Markus Huhn, Hans-Henning Kettler, Uwe Lamnek, Jan Oetjen, Christian 
Würst
Aufsichtsratsvorsitzender: Michael Scheeren

Member of United Internet

Reply via email to