I have a use case where my RDD is set up such:

Partition 0:
K1 -> [V1, V2]
K2 -> [V2]

Partition 1:
K3 -> [V1]
K4 -> [V3]

I want to invert this RDD, but only within a partition, so that the
operation does not require a shuffle.  It doesn't matter if the partitions
of the inverted RDD have non unique keys across the partitions, for example:

Partition 0:
V1 -> [K1]
V2 -> [K1, K2]

Partition 1:
V1 -> [K3]
V3 -> [K4]

Is there a way to do only a per-partition groupBy, instead of shuffling the
entire data?

Reply via email to