Re: rdd.filter(...).take(n) takes a long time to run because its all on the driver

Reynold Xin Tue, 17 Dec 2013 12:33:03 -0800

Actually in the latest version (0.8.1 or 0.9.0), the take would first
launch one task on the driver, and if the limit is not satisfied on the
first partition, it will launch multiple tasks to find the limits.



On Tue, Dec 17, 2013 at 11:04 AM, John Salvatier <[email protected]>wrote:

> I have something like:
>
> rdd
> .filter(...)
> .take(n)
>
> If rdd is large and filter reduces the size of the rdd by a lot, and
> especially if its smaller than n, then the take takes a long time to
> execute. I think this is because it all takes place on the driver, so the
> driver has to iterate through all of the data. Is there some way to make a
> distributed version of take that doesn't execute locally?
>
> I had in mind something like
>
> rdd
> .filter(...)
> .zipWithIndex()
> .filter{case (i, value) => i < n}
> .map(_._2)
>
> However, there's no zipWithIndex, and I haven't seen a simple way to
> emulate it. Any ideas?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Spark Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: rdd.filter(...).take(n) takes a long time to run because its all on the driver

Reply via email to