Re: rdd.filter(...).take(n) takes a long time to run because its all on the driver

John Salvatier Tue, 17 Dec 2013 12:47:20 -0800

Excellent! Thank you.


On Tue, Dec 17, 2013 at 11:48 AM, Reynold Xin <[email protected]> wrote:

> Actually in the latest version (0.8.1 or 0.9.0), the take would first
> launch one task on the driver, and if the limit is not satisfied on the
> first partition, it will launch multiple tasks to find the limits.
>
>
> On Tue, Dec 17, 2013 at 11:04 AM, John Salvatier <[email protected]>wrote:
>
>> I have something like:
>>
>> rdd
>> .filter(...)
>> .take(n)
>>
>> If rdd is large and filter reduces the size of the rdd by a lot, and
>> especially if its smaller than n, then the take takes a long time to
>> execute. I think this is because it all takes place on the driver, so the
>> driver has to iterate through all of the data. Is there some way to make a
>> distributed version of take that doesn't execute locally?
>>
>> I had in mind something like
>>
>> rdd
>> .filter(...)
>> .zipWithIndex()
>> .filter{case (i, value) => i < n}
>> .map(_._2)
>>
>> However, there's no zipWithIndex, and I haven't seen a simple way to
>> emulate it. Any ideas?
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Spark Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Spark Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: rdd.filter(...).take(n) takes a long time to run because its all on the driver

Reply via email to