Excellent! Thank you.
On Tue, Dec 17, 2013 at 11:48 AM, Reynold Xin <[email protected]> wrote: > Actually in the latest version (0.8.1 or 0.9.0), the take would first > launch one task on the driver, and if the limit is not satisfied on the > first partition, it will launch multiple tasks to find the limits. > > > On Tue, Dec 17, 2013 at 11:04 AM, John Salvatier <[email protected]>wrote: > >> I have something like: >> >> rdd >> .filter(...) >> .take(n) >> >> If rdd is large and filter reduces the size of the rdd by a lot, and >> especially if its smaller than n, then the take takes a long time to >> execute. I think this is because it all takes place on the driver, so the >> driver has to iterate through all of the data. Is there some way to make a >> distributed version of take that doesn't execute locally? >> >> I had in mind something like >> >> rdd >> .filter(...) >> .zipWithIndex() >> .filter{case (i, value) => i < n} >> .map(_._2) >> >> However, there's no zipWithIndex, and I haven't seen a simple way to >> emulate it. Any ideas? >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Spark Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- > You received this message because you are subscribed to the Google Groups > "Spark Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > -- You received this message because you are subscribed to the Google Groups "Spark Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
