Although caching is synonymous with persisting in memory, you can also just persist the result (partially) on disk. At least you would use as much RAM as you can.
Obviously that require re-reading the RDD (partially) from HDFS, and the point is avoiding reading data from HDFS several times. But maybe there is expensive work that happens in between reading the raw data and re-using results, so it's still a win. There's no equivalent of MultipleOutputs. On Thu, Oct 9, 2014 at 10:55 PM, Akshat Aranya <aara...@gmail.com> wrote: > Hi, > > Is there a good way to materialize derivate RDDs from say, a HadoopRDD while > reading in the data only once. One way to do so would be to cache the > HadoopRDD and then create derivative RDDs, but that would require enough RAM > to cache the HadoopRDD which is not an option in my case. > > Thanks, > Akshat --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org