You can also do something like rdd.sparkContext.runJob(rdd,(iter: Iterator[T]) => { while(iter.hasNext) iter.next() })
On Sat, Jan 31, 2015 at 5:24 AM, Sean Owen <so...@cloudera.com> wrote: > Yeah, from an unscientific test, it looks like the time to cache the > blocks still dominates. Saving the count is probably a win, but not > big. Well, maybe good to know. > > On Fri, Jan 30, 2015 at 10:47 PM, Stephen Boesch <java...@gmail.com> > wrote: > > Theoretically your approach would require less overhead - i.e. a collect > on > > the driver is not required as the last step. But maybe the difference is > > small and that particular path may or may not have been properly > optimized > > vs the count(). Do you have a biggish data set to compare the timings? > > > > 2015-01-30 14:42 GMT-08:00 Sean Owen <so...@cloudera.com>: > >> > >> So far, the canonical way to materialize an RDD just to make sure it's > >> cached is to call count(). That's fine but incurs the overhead of > >> actually counting the elements. > >> > >> However, rdd.foreachPartition(p => None) for example also seems to > >> cause the RDD to be materialized, and is a no-op. Is that a better way > >> to do it or am I not thinking of why it's insufficient? > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> For additional commands, e-mail: user-h...@spark.apache.org > >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >