Hi list,
I have a custom kafka consumer app dumping data from various topics to
HDFS. My kafka cluster consists of 5 physical nodes (56 CPU threads,
384G RAM, RAID5s). The consumer is a 28-instance app, 20 consumer
threads in each instance (all using the same consumer group)
This app reads lots of different topics (about 100 of them), some of
these topics receive lots of messages while others are less busy.
I've started hitting an issue as some of the topics started receiving
messages at an even higher rate - according to the metrics, the
consumption of the busiest topic starts to lag behind and the lag keeps
increasing during the traffic peak hours.
At this moment I was able to reduce the lag and eventually catch up by
setting fetch.min.bytes from 40960 back to default 1 and restarting all
my app instances but I'm pretty sure the problem will come back soon.
I don't think adding more partitions to the topic would solve the
problem as the topic already has lots of them - there are more consumer
threads available than the amount of this lagging topic's partitions.
I think I need a way to optimize the consumer somehow, to fetch more
data at once in each pass, if it's available.
So far I was fiddling with various options such as fetch.min.bytes,
max.poll.records, max.poll.interval.ms, and
partition.assignment.strategy but every time there's an actual issue,
it's kind of a guess-work and trial&error sort of thing.
Do you have any tips for me from an actual production environment with
huge traffic?
I stumbled upon this nice blogpost:
https://blog.devcaffeine.com/2016/11/kafka-partition-lag/
In the "One More ‘Why?’" section it talks about the situation I was
trying to describe:
> [assuming we have a consumer reading multiple partitions] when
partition X doesn’t have any messages, it blocks. To avoid heavy,
constant network chatter, the consumer will block for 10 seconds,
waiting for the server to accumulate enough messages to fill a chunk.
If, after the timeout, there are no messages, the consumer shrugs, and
moves on to the next partition.
However, from the article I am not sure what are the properties I am
supposed to change. What are the 'timeout' and 'chunk' size properties,
if I use the terminology from the blogpost?
Thanks in advance for any pointers.
Cheers,
--
David Watzke