I have something like:
rdd
.filter(...)
.take(n)
If rdd is large and filter reduces the size of the rdd by a lot, and
especially if its smaller than n, then the take takes a long time to
execute. I think this is because it all takes place on the driver, so the
driver has to iterate through all of the data. Is there some way to make a
distributed version of take that doesn't execute locally?
I had in mind something like
rdd
.filter(...)
.zipWithIndex()
.filter{case (i, value) => i < n}
.map(_._2)
However, there's no zipWithIndex, and I haven't seen a simple way to
emulate it. Any ideas?
--
You received this message because you are subscribed to the Google Groups
"Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.