Hi,
Is there a good way to materialize derivate RDDs from say, a HadoopRDD
while reading in the data only once. One way to do so would be to cache
the HadoopRDD and then create derivative RDDs, but that would require
enough RAM to cache the HadoopRDD which is not an option in my case.
Thanks,
Although caching is synonymous with persisting in memory, you can also
just persist the result (partially) on disk. At least you would use as
much RAM as you can.
Obviously that require re-reading the RDD (partially) from HDFS, and
the point is avoiding reading data from HDFS several times. But