Although caching is synonymous with persisting in memory, you can also
just persist the result (partially) on disk. At least you would use as
much RAM as you can.

Obviously that require re-reading the RDD (partially) from HDFS, and
the point is avoiding reading data from HDFS several times. But maybe
there is expensive work that happens in between reading the raw data
and re-using results, so it's still a win.

There's no equivalent of MultipleOutputs.

On Thu, Oct 9, 2014 at 10:55 PM, Akshat Aranya <aara...@gmail.com> wrote:
> Hi,
>
> Is there a good way to materialize derivate RDDs from say, a HadoopRDD while
> reading in the data only once.  One way to do so would be to cache the
> HadoopRDD and then create derivative RDDs, but that would require enough RAM
> to cache the HadoopRDD which is not an option in my case.
>
> Thanks,
> Akshat

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to