Avoiding data shuffling when reading pre-partitioned data from Kafka

Tommy May Fri, 03 Mar 2023 14:57:19 -0800

Hello,

My team has a Flink streaming job that does a stateful join across two high
throughput kafka topics. This results in a large amount of data ser/de and
shuffling (about 1gb/s for context). We're running into a bottleneck on
this shuffling step. We've attempted to optimize our flink configuration,
join logic, scale out the kafka topics & flink job, and speed up state
access. What we see is that the join step causes backpressure on the kafka
sources and lag slowly starts to accumulate.


One idea we had to optimize this is to pre-partition the data in kafka on
the same key that the join is happening on. This'll effectively reduce data
shuffling to 0 and remove the bottleneck that we're seeing. I've done some
research into the topic and from what I understand this is not
straightforward to take advantage of in Flink. It looks to be a fairly
commonly requested feature based on the many StackOverflow posts and slack
questions, and I noticed there is FLIP-186 which attempts to address this
topic as well.

Are there any upcoming plans to add this feature to a future Flink release?
I believe it'd be super impactful for similar large scale jobs out there.
I'd be interested in helping as well, but admittedly I'm relatively new to
Flink.  I poked around the code a bit, and it certainly did not seem like a
straightforward addition, so it may be best handled by someone with more
internal knowledge.

Thanks,
Tommy

Avoiding data shuffling when reading pre-partitioned data from Kafka

Reply via email to