One pass compute() to produce multiple RDDs

2014-10-09 Thread Akshat Aranya
Hi,

Is there a good way to materialize derivate RDDs from say, a HadoopRDD
while reading in the data only once.  One way to do so would be to cache
the HadoopRDD and then create derivative RDDs, but that would require
enough RAM to cache the HadoopRDD which is not an option in my case.

Thanks,
Akshat


Re: One pass compute() to produce multiple RDDs

2014-10-09 Thread Sean Owen
Although caching is synonymous with persisting in memory, you can also
just persist the result (partially) on disk. At least you would use as
much RAM as you can.

Obviously that require re-reading the RDD (partially) from HDFS, and
the point is avoiding reading data from HDFS several times. But maybe
there is expensive work that happens in between reading the raw data
and re-using results, so it's still a win.

There's no equivalent of MultipleOutputs.

On Thu, Oct 9, 2014 at 10:55 PM, Akshat Aranya aara...@gmail.com wrote:
 Hi,

 Is there a good way to materialize derivate RDDs from say, a HadoopRDD while
 reading in the data only once.  One way to do so would be to cache the
 HadoopRDD and then create derivative RDDs, but that would require enough RAM
 to cache the HadoopRDD which is not an option in my case.

 Thanks,
 Akshat

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org