I think the cheapest possible way to force materialization is something like

rdd.foreachPartition(i => None)

I get the use case, but as you can see there is a cost: you are forced
to materialize an RDD and cache it just to measure the computation
time. In principle this could be taking significantly more time than
not doing so, since otherwise several RDD stages might proceed without
ever even having to persist intermediate results in memory.

Consider looking at the Spark UI to see how much time a stage took,
although it's measuring end to end wall clock time, which may overlap
with other computations.

(or maybe you are disabling / enabling this logging for prod / test anyway)

On Sat, Feb 21, 2015 at 4:46 AM, pnpritchard
<nicholas.pritch...@falkonry.com> wrote:
> Is there a technique for forcing the evaluation of an RDD?
>
> I have used actions to do so but even the most basic "count" has a
> non-negligible cost (even on a cached RDD, repeated calls to count take
> time).
>
> My use case is for logging the execution time of the major components in my
> application. At the end of each component I have a statement like
> "rdd.cache().count()" and time how long it takes.
>
> Thanks in advance for any advice!
> Nick
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Force-RDD-evaluation-tp21748.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to