As far as I know you basically have two options: let partitions be recomputed (possibly caching / persisting memory only), or persist to disk (and memory) and suffer the cost of writing to disk. The question is which will be more expensive in your case. My experience is you're better off letting things be recomputed to begin with and then trying out some persisting.
It seems to me Spark would benefit from allowing actions to be batched up and then having Spark execute them intelligently. That is, each partition could be processed by multiple actions after being computed. But I don't believe there's any way to achieve this currently. If anyone does have a way to achieve this, I'd love to hear it. :-) On Wed, Nov 12, 2014 at 1:23 AM, Steve Lewis <lordjoe2...@gmail.com> wrote: > In my problem I have a number of intermediate JavaRDDs and would like to > be able to look at their sizes without destroying the RDD for sibsequent > processing. persist will do this but these are big and perisist seems > expensive and I am unsure of which StorageLevel is needed, Is there a way > to clone a JavaRDD or does anyong have good ideas on how to do this? > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io