Patrick, If I understand this correctly, I won't be able to do this in the closure provided to mapPartitions() because that's going to be stateless, in the sense that a hash map that I create within the closure would only be useful for one call of MapPartitionsRDD.compute(). I guess I would need to override mapPartitions() directly within my RDD. Right?
On Tue, Sep 16, 2014 at 4:57 PM, Patrick Wendell <pwend...@gmail.com> wrote: > If each partition can fit in memory, you can do this using > mapPartitions and then building an inverse mapping within each > partition. You'd need to construct a hash map within each partition > yourself. > > On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya <aara...@gmail.com> wrote: > > I have a use case where my RDD is set up such: > > > > Partition 0: > > K1 -> [V1, V2] > > K2 -> [V2] > > > > Partition 1: > > K3 -> [V1] > > K4 -> [V3] > > > > I want to invert this RDD, but only within a partition, so that the > > operation does not require a shuffle. It doesn't matter if the > partitions > > of the inverted RDD have non unique keys across the partitions, for > example: > > > > Partition 0: > > V1 -> [K1] > > V2 -> [K1, K2] > > > > Partition 1: > > V1 -> [K3] > > V3 -> [K4] > > > > Is there a way to do only a per-partition groupBy, instead of shuffling > the > > entire data? > > >