I have something like:

rdd
.filter(...)
.take(n)

If rdd is large and filter reduces the size of the rdd by a lot, and
especially if its smaller than n, then the take takes a long time to
execute. I think this is because it all takes place on the driver, so the
driver has to iterate through all of the data. Is there some way to make a
distributed version of take that doesn't execute locally?

I had in mind something like

rdd
.filter(...)
.zipWithIndex()
.filter{case (i, value) => i < n}
.map(_._2)

However, there's no zipWithIndex, and I haven't seen a simple way to
emulate it. Any ideas?

-- 
You received this message because you are subscribed to the Google Groups 
"Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to