my initial approach to taking top k values of a rdd was using a
priority-queue monoid. along these lines:

rdd.mapPartitions({ items => Iterator.single(new PriorityQueue(...)) },
false).reduce(monoid.plus)

this works fine, but looking at the code for reduce it first reduces within
a partition (which doesnt help me) and then sends the results to the driver
where these again get reduced. this means that for every partition the
(potentially very bulky) priorityqueue gets shipped to the driver.

my driver is client side, not inside cluster, and i cannot change this, so
this shipping to driver of all these queues can be expensive.

is there a better way to do this? should i try to a shuffle first to reduce
the partitions to the minimal amount (since number of queues shipped is
equal to number of partitions)?

is was a way to reduce to a single item RDD, so the queues stay inside
cluster and i can retrieve the final result with RDD.first?

Reply via email to