As far as I know you basically have two options: let partitions be
recomputed (possibly caching / persisting memory only), or persist to disk
(and memory) and suffer the cost of writing to disk. The question is which
will be more expensive in your case. My experience is you're better off
letting things be recomputed to begin with and then trying out some
persisting.

It seems to me Spark would benefit from allowing actions to be batched up
and then having Spark execute them intelligently. That is, each partition
could be processed by multiple actions after being computed. But I don't
believe there's any way to achieve this currently.

If anyone does have a way to achieve this, I'd love to hear it. :-)

On Wed, Nov 12, 2014 at 1:23 AM, Steve Lewis <lordjoe2...@gmail.com> wrote:

>  In my problem I have a number of intermediate JavaRDDs and would like to
> be able to look at their sizes without destroying the RDD for sibsequent
> processing. persist will do this but these are big and perisist seems
> expensive and I am unsure of which StorageLevel is needed, Is there a way
> to clone a JavaRDD or does anyong have good ideas on how to do this?
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Reply via email to