If you want to force materialization use .count()
Also if you can simply don't unpersist anything, unless you really need to free the memory — Sent from Mailbox On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim <taeyun....@innowireless.co.kr> wrote: > BTW, it is possible that rdd.first() does not compute the whole partitions. > So, first() cannot be uses for the situation below. > -----Original Message----- > From: innowireless TaeYun Kim [mailto:taeyun....@innowireless.co.kr] > Sent: Wednesday, June 11, 2014 11:40 AM > To: user@spark.apache.org > Subject: Question about RDD cache, unpersist, materialization > Hi, > What I (seems to) know about RDD persisting API is as follows: > - cache() and persist() is not an action. It only does a marking. > - unpersist() is also not an action. It only removes a marking. But if the > rdd is already in memory, it is unloaded. > And there seems no API to forcefully materialize the RDD without requiring a > data by an action method, for example first(). > So, I am faced with the following scenario. > { > JavaRDD<T> rddUnion = sc.parallelize(new ArrayList<T>()); // create > empty for merging > for (int i = 0; i < 10; i++) > { > JavaRDD<T2> rdd = sc.textFile(inputFileNames[i]); > rdd.cache(); // Since it will be used twice, cache. > rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]); // > Transform and save, rdd materializes > rddUnion = rddUnion.union(rdd.map(...).filter(...)); // Do another > transform to T and merge by union > rdd.unpersist(); // Now it seems not needed. (But needed actually) > } > // Here, rddUnion actually materializes, and needs all 10 rdds that > already unpersisted. > // So, rebuilding all 10 rdds will occur. > rddUnion.saveAsTextFile(mergedFileName); > } > If rddUnion can be materialized before the rdd.unpersist() line and > cache()d, the rdds in the loop will not be needed on > rddUnion.saveAsTextFile(). > Now what is the best strategy? > - Do not unpersist all 10 rdds in the loop. > - Materialize rddUnion in the loop by calling 'light' action API, like > first(). > - Give up and just rebuild/reload all 10 rdds when saving rddUnion. > Is there some misunderstanding? > Thanks.