Re: Avoiding data shuffling when reading pre-partitioned data from Kafka

2023-03-07 Thread David Morávek
k-serialization-tuning-vol.-1-choosing-your-serializer-if-you-can/ > > > > > > > > *From:* Tommy May > *Sent:* Tuesday, March 7, 2023 3:25 AM > *To:* David Morávek > *Cc:* Ken Krugler ; Flink User List < > user@flink.apache.org> > *Subject:* Re: Avoid

RE: Avoiding data shuffling when reading pre-partitioned data from Kafka

2023-03-07 Thread Schwalbe Matthias
shuffling when reading pre-partitioned data from Kafka ⚠EXTERNAL MESSAGE – CAUTION: Think Before You Click ⚠ Hi Ken & David, Thanks for following up. I've responded to your questions below. If the number of unique keys isn’t huge, I could think of yet another helicopter stunt that you

Re: Avoiding data shuffling when reading pre-partitioned data from Kafka

2023-03-06 Thread Tommy May
Hi Ken & David, Thanks for following up. I've responded to your questions below. If the number of unique keys isn’t huge, I could think of yet another > helicopter stunt that you could try :) Unfortunately the number of keys in our case is huge, they're unique per handful of events. If your d

Re: Avoiding data shuffling when reading pre-partitioned data from Kafka

2023-03-06 Thread David Morávek
Using an operator state for a stateful join isn't great because it's meant to hold only a minimal state related to the operator (e.g., partition tracking). If your data are already pre-partitioned and the partitioning matches (hash partitioning on the JAVA representation of the key yielded by the

Re: Avoiding data shuffling when reading pre-partitioned data from Kafka

2023-03-04 Thread Ken Krugler
Hi Tommy, To use stateful timers, you need to have a keyed stream, which gets tricky when you’re trying to avoid network traffic caused by the keyBy() If the number of unique keys isn’t huge, I could think of yet another helicopter stunt that you could try :) It’s possible to calculate a compo

Re: Avoiding data shuffling when reading pre-partitioned data from Kafka

2023-03-04 Thread Tommy May
Hello Ken, Thanks for the quick response! That is an interesting workaround. In our case though we are using a CoProcessFunction with stateful timers. Is there a similar workaround path available in that case? The one possible way I could find required partitioning data in kafka in a very specific

Re: Avoiding data shuffling when reading pre-partitioned data from Kafka

2023-03-04 Thread Ken Krugler
Hi Tommy, I believe there is a way to make this work currently, but with lots of caveats and constraints. This assumes you want to avoid any network shuffle. 1. Both topics have names that return the same value for ((topicName.hashCode() * 31) & 0x7) % parallelism. 2. Both topics have the

Avoiding data shuffling when reading pre-partitioned data from Kafka

2023-03-03 Thread Tommy May
Hello, My team has a Flink streaming job that does a stateful join across two high throughput kafka topics. This results in a large amount of data ser/de and shuffling (about 1gb/s for context). We're running into a bottleneck on this shuffling step. We've attempted to optimize our flink configura