+1 – you can definitely make it work by making sure you are using the same
partitioner (including the same number of partitions).
For most operations like reduceByKey, updateStateByKey – simply specifying it
enough.
There are some gotchas for other operations:
* mapValues and
Well, reduceByKey needs to shutffle if your intermediate data is not
already partitioned in the same way as reduceByKey's partitioning.
reduceByKey() has other signatures that take in a partitioner, or simply
number of partitions. So you can set the same partitioner as your previous
stage.
Hi,
Currently I have a job that has spills to disk and memory due to usage of
reduceByKey and a lot of intermediate data in reduceByKey that gets
shuffled.
How to use custom partitioner in Spark Streaming for an intermediate stage
so that the next stage that uses reduceByKey does not have to