One pass compute() to produce multiple RDDs

2014-10-09 Thread Akshat Aranya
Hi, Is there a good way to materialize derivate RDDs from say, a HadoopRDD while reading in the data only once. One way to do so would be to cache the HadoopRDD and then create derivative RDDs, but that would require enough RAM to cache the HadoopRDD which is not an option in my case. Thanks,

Re: One pass compute() to produce multiple RDDs

2014-10-09 Thread Sean Owen
Although caching is synonymous with persisting in memory, you can also just persist the result (partially) on disk. At least you would use as much RAM as you can. Obviously that require re-reading the RDD (partially) from HDFS, and the point is avoiding reading data from HDFS several times. But