Hi,

I was wondering if it would be safe for me to make use of
reinterpretAsKeyedStream on a Kafka source in order to have an
"embarrassingly parallel" job without any .keyBy().

My Kafka topic is partitioned by the same id I'm then sending through a
session window operator. Therefore there's in theory no need for data to be
transferred between subtasks (between Kafka source and the windowing
operator). Is it possible to avoid this by using reinterpretAsKeyedStream on
the source?

I'm worried about the warning from the docs saying:

WARNING: The re-interpreted data stream MUST already be pre-partitioned in
EXACTLY the same way Flinkā€™s keyBy would partition the data in a shuffle
w.r.t. key-group assignment.

Of course, the partitioning in Kafka will not be *exactly* the same... What
problems might this cause? I did it on a very small subset of data and
didn't notice any issues.

-- 
Piotr Domagalski

Reply via email to