Theoretically your approach would require less overhead - i.e. a collect on
the driver is not required as the last step.  But maybe the difference is
small and that particular path may or may not have been properly optimized
vs the count(). Do you have a biggish data set to compare the timings?

2015-01-30 14:42 GMT-08:00 Sean Owen <so...@cloudera.com>:

> So far, the canonical way to materialize an RDD just to make sure it's
> cached is to call count(). That's fine but incurs the overhead of
> actually counting the elements.
>
> However, rdd.foreachPartition(p => None) for example also seems to
> cause the RDD to be materialized, and is a no-op. Is that a better way
> to do it or am I not thinking of why it's insufficient?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to