Re: Painfully slow kafka recovery

Jörg Wagner Tue, 25 Aug 2015 06:19:03 -0700

So okay, this is a little embarassing but the core of the issue was thatmax open files was not set correctly for kafka. It was not an oversight,but a few things together caused that the system configuration was notchanged correctly, resulting in the default value.

No wonder that kafka behaved strangely everytime we had enough data inlog.dirs and connections.

Anyhow, that doesn't seem to be the last problem. The brokers get insync with each other (within an expected time frame), everything seems fine.

After a little stress testing, the kafka cluster falls apart (around 40krequests/s). Using topics describe we can see leaders missing (e.g. from1,2,3 only 1 and 3 are leading partitions, although zookeeper lists allunder /brokers/ids). This ultimately results in partitions beingunavailable and massive "leader not local" spam in the logs.


What are we missing?

Cheers
Jörg

On 24.08.2015 10:31, Jörg Wagner wrote:

Thank you for your answers.

@Raja
No, it also seems to happen if we stop kafka completely clean.

@Gwen
I was testing the situation with num.replica.fetchers set higher. Ifyou say that was the right direction, I will try it again. What wouldbe a good setting? I went with 50 which seemed reasonable (having 27single disks).
How long should it take to get complete ISR?
Regarding no Data flowing into kafka: I just wanted to point out thatthe setup is not yet live. So we can completely stop the usage ofkafka, and it should possibly get into sync faster without a steadystream of new messages.Kafka itself is working fine during this on the other hand, "just"missing ISR, hence redundancy. If I stop another broker (clean!)though, it tends to happen that the expected number of partitions haveLeader -1; which should not happen as I assume.
Cheers
Jörg

On 21.08.2015 19:18, Rajasekar Elango wrote:
We are seeing same behavior in 5 broker cluster when losing one broker.

In our case, we are losing broker as well as kafka data dir.

Jörg Wagner,

Are you losing just broker or kafka data dir as well?

Gwen,
We have also observed that latency of messages arriving at consumersgoes
up by 10x when we lose a broker. Is it because the broker is busy with
handling failed fetch requests and loaded with more data thatsslowing down
the writes ? Also, if we had simply lost the broker not the data dir,
impact would have been minimal?

Thanks,
Raja.
On Fri, Aug 21, 2015 at 12:31 PM, Gwen Shapira <g...@confluent.io>wrote:
By default, num.replica.fetchers = 1. This means only one thread perbroker
is fetching data from leaders. This means it make take a while for the
recovering machine to catch up and rejoin the ISR.

If you have bandwidth to spare, try increasing this value.
Regarding "no data flowing into kafka" - If you have 3 replicas andonlyone is down, I'd expect writes to continue to the new leader even ifonereplica is not in the ISR yet. Can you see that a new leader iselected?
Gwen

On Fri, Aug 21, 2015 at 6:50 AM, Jörg Wagner <joerg.wagn...@1und1.de>
wrote:
Hey everyone,

here's my crosspost from irc.

Our setup:
3 kafka 0.8.2 brokers with zookeeper, powerful hardware (20 cores, 27
logdisks each). We use a handful of topics, but only one topic is
utilized
heavily. It features a replication of 2 and 600 partitions.

Our issue:
If one kafka was down, it takes very long ( from 1 to >10 hours) toshowthat all partitions have all isr again. This seems to heavilydepend on
the
amount of data which is in the log.dirs (I have configured 27threads -
one
for each dir featuring a own drive).
This all takes this long while there is NO data flowing into kafka.

We seem to be missing something critical here. It might be some option
set
wrong, or are we thinking wrong and it's not critical to have the
replicas
in sync.

Any pointers would be great.

Cheers
Jörg


--
Mit freundlichem Gruß

Jörg Wagner

Mobile & Services


1&1 Internet AG | Sapporobogen 6-8 | 80637 München | Germany
Phone: +49 89 14339 324
E-Mail: joerg.wagn...@1und1.de | Web: www.1und1.de

Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 6484

Vorstand: Ralph Dommermuth, Frank Einhellinger, Robert Hoffmann, Andreas 
Hofmann, Markus Huhn, Hans-Henning Kettler, Uwe Lamnek, Jan Oetjen, Christian 
Würst
Aufsichtsratsvorsitzender: Michael Scheeren

Member of United Internet

Re: Painfully slow kafka recovery

Reply via email to