Re: taking top k values of rdd

Koert Kuipers Sat, 05 Jul 2014 10:21:25 -0700

i guess i could create a single priorityque per partition, then shuffle to
a new rdd with 1 partition, and then reduce?



On Sat, Jul 5, 2014 at 1:16 PM, Koert Kuipers <ko...@tresata.com> wrote:

> my initial approach to taking top k values of a rdd was using a
> priority-queue monoid. along these lines:
>
> rdd.mapPartitions({ items => Iterator.single(new PriorityQueue(...)) },
> false).reduce(monoid.plus)
>
> this works fine, but looking at the code for reduce it first reduces
> within a partition (which doesnt help me) and then sends the results to the
> driver where these again get reduced. this means that for every partition
> the (potentially very bulky) priorityqueue gets shipped to the driver.
>
> my driver is client side, not inside cluster, and i cannot change this, so
> this shipping to driver of all these queues can be expensive.
>
> is there a better way to do this? should i try to a shuffle first to
> reduce the partitions to the minimal amount (since number of queues shipped
> is equal to number of partitions)?
>
> is was a way to reduce to a single item RDD, so the queues stay inside
> cluster and i can retrieve the final result with RDD.first?
>

Re: taking top k values of rdd

Reply via email to