If you want to force materialization use .count()

Also if you can simply don't unpersist anything, unless you really need to free 
the memory 
—
Sent from Mailbox

On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim
<taeyun....@innowireless.co.kr> wrote:

> BTW, it is possible that rdd.first() does not compute the whole partitions.
> So, first() cannot be uses for the situation below.
> -----Original Message-----
> From: innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr] 
> Sent: Wednesday, June 11, 2014 11:40 AM
> To: user@spark.apache.org
> Subject: Question about RDD cache, unpersist, materialization
> Hi,
> What I (seems to) know about RDD persisting API is as follows:
> - cache() and persist() is not an action. It only does a marking.
> - unpersist() is also not an action. It only removes a marking. But if the
> rdd is already in memory, it is unloaded.
> And there seems no API to forcefully materialize the RDD without requiring a
> data by an action method, for example first().
> So, I am faced with the following scenario.
> {
>     JavaRDD<T> rddUnion = sc.parallelize(new ArrayList<T>());  // create
> empty for merging
>     for (int i = 0; i < 10; i++)
>     {
>         JavaRDD<T2> rdd = sc.textFile(inputFileNames[i]);
>         rdd.cache();  // Since it will be used twice, cache.
>         rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]);  //
> Transform and save, rdd materializes
>         rddUnion = rddUnion.union(rdd.map(...).filter(...));  // Do another
> transform to T and merge by union
>         rdd.unpersist();  // Now it seems not needed. (But needed actually)
>     }
>     // Here, rddUnion actually materializes, and needs all 10 rdds that
> already unpersisted.
>     // So, rebuilding all 10 rdds will occur.
>     rddUnion.saveAsTextFile(mergedFileName);
> }
> If rddUnion can be materialized before the rdd.unpersist() line and
> cache()d, the rdds in the loop will not be needed on
> rddUnion.saveAsTextFile().
> Now what is the best strategy?
> - Do not unpersist all 10 rdds in the loop.
> - Materialize rddUnion in the loop by calling 'light' action API, like
> first().
> - Give up and just rebuild/reload all 10 rdds when saving rddUnion.
> Is there some misunderstanding?
> Thanks.

Reply via email to