You can also do something like
rdd.sparkContext.runJob(rdd,(iter: Iterator[T]) = {
while(iter.hasNext) iter.next()
})
On Sat, Jan 31, 2015 at 5:24 AM, Sean Owen so...@cloudera.com wrote:
Yeah, from an unscientific test, it looks like the time to cache the
blocks still dominates. Saving
So far, the canonical way to materialize an RDD just to make sure it's
cached is to call count(). That's fine but incurs the overhead of
actually counting the elements.
However, rdd.foreachPartition(p = None) for example also seems to
cause the RDD to be materialized, and is a no-op. Is that a
Theoretically your approach would require less overhead - i.e. a collect on
the driver is not required as the last step. But maybe the difference is
small and that particular path may or may not have been properly optimized
vs the count(). Do you have a biggish data set to compare the timings?
Yeah, from an unscientific test, it looks like the time to cache the
blocks still dominates. Saving the count is probably a win, but not
big. Well, maybe good to know.
On Fri, Jan 30, 2015 at 10:47 PM, Stephen Boesch java...@gmail.com wrote:
Theoretically your approach would require less