Re: Cheapest way to materialize an RDD?

2015-02-02 Thread Raghavendra Pandey
You can also do something like rdd.sparkContext.runJob(rdd,(iter: Iterator[T]) = { while(iter.hasNext) iter.next() }) On Sat, Jan 31, 2015 at 5:24 AM, Sean Owen so...@cloudera.com wrote: Yeah, from an unscientific test, it looks like the time to cache the blocks still dominates. Saving

Cheapest way to materialize an RDD?

2015-01-30 Thread Sean Owen
So far, the canonical way to materialize an RDD just to make sure it's cached is to call count(). That's fine but incurs the overhead of actually counting the elements. However, rdd.foreachPartition(p = None) for example also seems to cause the RDD to be materialized, and is a no-op. Is that a

Re: Cheapest way to materialize an RDD?

2015-01-30 Thread Stephen Boesch
Theoretically your approach would require less overhead - i.e. a collect on the driver is not required as the last step. But maybe the difference is small and that particular path may or may not have been properly optimized vs the count(). Do you have a biggish data set to compare the timings?

Re: Cheapest way to materialize an RDD?

2015-01-30 Thread Sean Owen
Yeah, from an unscientific test, it looks like the time to cache the blocks still dominates. Saving the count is probably a win, but not big. Well, maybe good to know. On Fri, Jan 30, 2015 at 10:47 PM, Stephen Boesch java...@gmail.com wrote: Theoretically your approach would require less