Hi,

What I (seems to) know about RDD persisting API is as follows:
- cache() and persist() is not an action. It only does a marking.
- unpersist() is also not an action. It only removes a marking. But if the
rdd is already in memory, it is unloaded.

And there seems no API to forcefully materialize the RDD without requiring a
data by an action method, for example first().

So, I am faced with the following scenario.

{
    JavaRDD<T> rddUnion = sc.parallelize(new ArrayList<T>());  // create
empty for merging
    for (int i = 0; i < 10; i++)
    {
        JavaRDD<T2> rdd = sc.textFile(inputFileNames[i]);
        rdd.cache();  // Since it will be used twice, cache.
        rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]);  //
Transform and save, rdd materializes
        rddUnion = rddUnion.union(rdd.map(...).filter(...));  // Do another
transform to T and merge by union
        rdd.unpersist();  // Now it seems not needed. (But needed actually)
    }
    // Here, rddUnion actually materializes, and needs all 10 rdds that
already unpersisted.
    // So, rebuilding all 10 rdds will occur.
    rddUnion.saveAsTextFile(mergedFileName);
}

If rddUnion can be materialized before the rdd.unpersist() line and
cache()d, the rdds in the loop will not be needed on
rddUnion.saveAsTextFile().

Now what is the best strategy?
- Do not unpersist all 10 rdds in the loop.
- Materialize rddUnion in the loop by calling 'light' action API, like
first().
- Give up and just rebuild/reload all 10 rdds when saving rddUnion.

Is there some misunderstanding?

Thanks.


Reply via email to