i guess i could create a single priorityque per partition, then shuffle to a new rdd with 1 partition, and then reduce?
On Sat, Jul 5, 2014 at 1:16 PM, Koert Kuipers <ko...@tresata.com> wrote: > my initial approach to taking top k values of a rdd was using a > priority-queue monoid. along these lines: > > rdd.mapPartitions({ items => Iterator.single(new PriorityQueue(...)) }, > false).reduce(monoid.plus) > > this works fine, but looking at the code for reduce it first reduces > within a partition (which doesnt help me) and then sends the results to the > driver where these again get reduced. this means that for every partition > the (potentially very bulky) priorityqueue gets shipped to the driver. > > my driver is client side, not inside cluster, and i cannot change this, so > this shipping to driver of all these queues can be expensive. > > is there a better way to do this? should i try to a shuffle first to > reduce the partitions to the minimal amount (since number of queues shipped > is equal to number of partitions)? > > is was a way to reduce to a single item RDD, so the queues stay inside > cluster and i can retrieve the final result with RDD.first? >