Re: Question about RDD cache, unpersist, materialization

Nicholas Chammas Thu, 12 Jun 2014 17:33:13 -0700

FYI: Here is a related discussion
<http://apache-spark-user-list.1001560.n3.nabble.com/Persist-and-unpersist-td6437.html>
about this.



On Thu, Jun 12, 2014 at 8:10 PM, innowireless TaeYun Kim <
taeyun....@innowireless.co.kr> wrote:

> Maybe It would be nice that unpersist() ‘triggers’ the computations of
> other rdds that depends on it but not yet computed.
> The pseudo code can be as follows:
>
>
>
> unpersist()
> {
>     if (this rdd has not been persisted)
>         return;
>     for (all rdds that depends on this rdd but not yet computed)
>         compute_that_rdd;
>     do_actual_unpersist();
> }
>
>
>
> *From:* Daniel Siegmann [mailto:daniel.siegm...@velos.io]
> *Sent:* Friday, June 13, 2014 5:38 AM
> *To:* user@spark.apache.org
> *Subject:* Re: Question about RDD cache, unpersist, materialization
>
>
>
> I've run into this issue. The goal of caching / persist seems to be to
> avoid recomputing an RDD when its data will be needed multiple times.
> However, once the following RDDs are computed the cache is no longer
> needed. The currently design provides no obvious way to detect when the
> cache is no longer needed so it can be discarded.
>
> In the case of cache in memory, it may be handled by partitions being
> dropped (in LRU order) when memory fills up. I need to do some more
> experimentation to see if this really works well, or if allowing memory to
> fill up causes performance issues or possibly OOM errors if data isn't
> correctly freed.
>
> In the case of persisting to disk, I'm not sure if there's a way to limit
> the disk space used for caching. Does anyone know if there is such a
> configuration option? This is a pressing issue for me - I have had jobs
> fail because nodes ran out of disk space.
>
>
>
> On Wed, Jun 11, 2014 at 2:26 AM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
> If you want to force materialization use .count()
>
>
>
> Also if you can simply don't unpersist anything, unless you really need to
> free the memory
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
>
> On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim <
> taeyun....@innowireless.co.kr> wrote:
>
> BTW, it is possible that rdd.first() does not compute the whole
> partitions.
> So, first() cannot be uses for the situation below.
>
> -----Original Message-----
> From: innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr]
> Sent: Wednesday, June 11, 2014 11:40 AM
> To: user@spark.apache.org
> Subject: Question about RDD cache, unpersist, materialization
>
> Hi,
>
> What I (seems to) know about RDD persisting API is as follows:
> - cache() and persist() is not an action. It only does a marking.
> - unpersist() is also not an action. It only removes a marking. But if the
> rdd is already in memory, it is unloaded.
>
> And there seems no API to forcefully materialize the RDD without requiring
> a
> data by an action method, for example first().
>
> So, I am faced with the following scenario.
>
> {
> JavaRDD<T> rddUnion = sc.parallelize(new ArrayList<T>()); // create
> empty for merging
> for (int i = 0; i < 10; i++)
> {
> JavaRDD<T2> rdd = sc.textFile(inputFileNames[i]);
> rdd.cache(); // Since it will be used twice, cache.
> rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]); //
> Transform and save, rdd materializes
> rddUnion = rddUnion.union(rdd.map(...).filter(...)); // Do another
> transform to T and merge by union
> rdd.unpersist(); // Now it seems not needed. (But needed actually)
> }
> // Here, rddUnion actually materializes, and needs all 10 rdds that
> already unpersisted.
> // So, rebuilding all 10 rdds will occur.
> rddUnion.saveAsTextFile(mergedFileName);
> }
>
> If rddUnion can be materialized before the rdd.unpersist() line and
> cache()d, the rdds in the loop will not be needed on
> rddUnion.saveAsTextFile().
>
> Now what is the best strategy?
> - Do not unpersist all 10 rdds in the loop.
> - Materialize rddUnion in the loop by calling 'light' action API, like
> first().
> - Give up and just rebuild/reload all 10 rdds when saving rddUnion.
>
> Is there some misunderstanding?
>
> Thanks.
>
>
>
>
>
>
> --
>
> Daniel Siegmann, Software Developer
> Velos
>
> Accelerating Machine Learning
>
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io
>

Re: Question about RDD cache, unpersist, materialization

Reply via email to