Hi Dan,

1) If the key doesn’t change in the downstream operators and you want to avoid 
shuffling, maybe the DataStreamUtils#reinterpretAsKeyedStream would be helpful.

2) I am not sure that if you are saying that the data are already partitioned 
in the Kafka and you want to avoid shuffling in the Flink because of reusing 
keyBy(). One solution is that you can try to partition your data in the Kafka 
as if it was partitioned in the Flink when using keyBy(). After that, feel free 
to use  DataStreamUtils#reinterpretAsKeyedStream!

If your use case is not what I described above, maybe you can provide us more 
information.

Best,
Senhong

Sent with a Spark
On Jul 22, 2021, 7:33 AM +0800, Dan Hill <quietgol...@gmail.com>, wrote:
> Hi.
>
> 1) If I use the same key in downstream operators (my key is a user id), will 
> the rows stay on the same TaskManager machine?  I join in more info based on 
> the user id as the key.  I'd like for these to stay on the same machine 
> rather than shuffle a bunch of user-specific info to multiple task manager 
> machines.
>
> 2) What are best practices to reduce the number of shuffles when having 
> multiple kafka topics with similar keys (user id).  E.g. should I make make 
> sure the same key writes to the same partition number and then manually which 
> flink tasks get which kafka partitions?

Reply via email to