So okay, this is a little embarassing but the core of the issue was that
max open files was not set correctly for kafka. It was not an oversight,
but a few things together caused that the system configuration was not
changed correctly, resulting in the default value.
No wonder that kafka behaved strangely everytime we had enough data in
log.dirs and connections.
Anyhow, that doesn't seem to be the last problem. The brokers get in
sync with each other (within an expected time frame), everything seems fine.
After a little stress testing, the kafka cluster falls apart (around 40k
requests/s). Using topics describe we can see leaders missing (e.g. from
1,2,3 only 1 and 3 are leading partitions, although zookeeper lists all
under /brokers/ids). This ultimately results in partitions being
unavailable and massive "leader not local" spam in the logs.
What are we missing?
Cheers
Jörg
On 24.08.2015 10:31, Jörg Wagner wrote:
Thank you for your answers.
@Raja
No, it also seems to happen if we stop kafka completely clean.
@Gwen
I was testing the situation with num.replica.fetchers set higher. If
you say that was the right direction, I will try it again. What would
be a good setting? I went with 50 which seemed reasonable (having 27
single disks).
How long should it take to get complete ISR?
Regarding no Data flowing into kafka: I just wanted to point out that
the setup is not yet live. So we can completely stop the usage of
kafka, and it should possibly get into sync faster without a steady
stream of new messages.
Kafka itself is working fine during this on the other hand, "just"
missing ISR, hence redundancy. If I stop another broker (clean!)
though, it tends to happen that the expected number of partitions have
Leader -1; which should not happen as I assume.
Cheers
Jörg
On 21.08.2015 19:18, Rajasekar Elango wrote:
We are seeing same behavior in 5 broker cluster when losing one broker.
In our case, we are losing broker as well as kafka data dir.
Jörg Wagner,
Are you losing just broker or kafka data dir as well?
Gwen,
We have also observed that latency of messages arriving at consumers
goes
up by 10x when we lose a broker. Is it because the broker is busy with
handling failed fetch requests and loaded with more data thats
slowing down
the writes ? Also, if we had simply lost the broker not the data dir,
impact would have been minimal?
Thanks,
Raja.
On Fri, Aug 21, 2015 at 12:31 PM, Gwen Shapira <g...@confluent.io>
wrote:
By default, num.replica.fetchers = 1. This means only one thread per
broker
is fetching data from leaders. This means it make take a while for the
recovering machine to catch up and rejoin the ISR.
If you have bandwidth to spare, try increasing this value.
Regarding "no data flowing into kafka" - If you have 3 replicas and
only
one is down, I'd expect writes to continue to the new leader even if
one
replica is not in the ISR yet. Can you see that a new leader is
elected?
Gwen
On Fri, Aug 21, 2015 at 6:50 AM, Jörg Wagner <joerg.wagn...@1und1.de>
wrote:
Hey everyone,
here's my crosspost from irc.
Our setup:
3 kafka 0.8.2 brokers with zookeeper, powerful hardware (20 cores, 27
logdisks each). We use a handful of topics, but only one topic is
utilized
heavily. It features a replication of 2 and 600 partitions.
Our issue:
If one kafka was down, it takes very long ( from 1 to >10 hours) to
show
that all partitions have all isr again. This seems to heavily
depend on
the
amount of data which is in the log.dirs (I have configured 27
threads -
one
for each dir featuring a own drive).
This all takes this long while there is NO data flowing into kafka.
We seem to be missing something critical here. It might be some option
set
wrong, or are we thinking wrong and it's not critical to have the
replicas
in sync.
Any pointers would be great.
Cheers
Jörg
--
Mit freundlichem Gruß
Jörg Wagner
Mobile & Services
1&1 Internet AG | Sapporobogen 6-8 | 80637 München | Germany
Phone: +49 89 14339 324
E-Mail: joerg.wagn...@1und1.de | Web: www.1und1.de
Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 6484
Vorstand: Ralph Dommermuth, Frank Einhellinger, Robert Hoffmann, Andreas
Hofmann, Markus Huhn, Hans-Henning Kettler, Uwe Lamnek, Jan Oetjen, Christian
Würst
Aufsichtsratsvorsitzender: Michael Scheeren
Member of United Internet