RE: Question about RDD cache, unpersist, materialization

2014-06-12 Thread innowireless TaeYun Kim
(I¡¯ve clarified the statement (1) of my previous mail. See below.)

 

From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr] 
Sent: Friday, June 13, 2014 10:05 AM
To: user@spark.apache.org
Subject: RE: Question about RDD cache, unpersist, materialization

 

Currently I use rdd.count() for forceful computation, as Nick Pentreath 
suggested.

 

I think that it will be nice to have a method that forcefully computes a rdd, 
so that the unnecessary rdds are safely unpersist()ed.

 

Let’s think a case that a rdd_a is a parent of both:

(1) a short-term rdd_s that depends only on rdd_a (maybe rdd_s is a 
transformation of rdd_a)

(2) and a long-term rdd_t that depends on the rdds that may be computed later. 
(may be for some aggregation)

Usually rdd_s is computed early, and rdd_a is computed for it. But it has to 
remain in memory until rdd_t is eventually computed. (Or recomputed when rdd_t 
is computed)

It would be nice if rdd_t could be computed when rdd_s is computed, so that 
rdd_a can be unpersist()ed, since it will not be used anymore.

(Currently I use rdd_t.count() for that)

 

sc.prune() which was suggested in the related discussion you provided is not 
helpful for this case, since rdd_a is ‘referenced’ by ‘lazy’ rdd_t.

(Get up rdd_t, and do it with rdd_s while rdd_a is here for rdd_s.)

 

 

From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com] 
Sent: Friday, June 13, 2014 9:31 AM
To: user
Subject: Re: Question about RDD cache, unpersist, materialization

 

FYI: Here is a related discussion 
<http://apache-spark-user-list.1001560.n3.nabble.com/Persist-and-unpersist-td6437.html>
  about this.

 

On Thu, Jun 12, 2014 at 8:10 PM, innowireless TaeYun Kim 
 wrote:

Maybe It would be nice that unpersist() ¡®triggers¡¯ the computations of other 
rdds that depends on it but not yet computed.
The pseudo code can be as follows:

 

unpersist()
{
if (this rdd has not been persisted)
return;
for (all rdds that depends on this rdd but not yet computed)
compute_that_rdd;
do_actual_unpersist();
}

 

From: Daniel Siegmann [mailto:daniel.siegm...@velos.io] 
Sent: Friday, June 13, 2014 5:38 AM
To: user@spark.apache.org
Subject: Re: Question about RDD cache, unpersist, materialization

 

I've run into this issue. The goal of caching / persist seems to be to avoid 
recomputing an RDD when its data will be needed multiple times. However, once 
the following RDDs are computed the cache is no longer needed. The currently 
design provides no obvious way to detect when the cache is no longer needed so 
it can be discarded.

In the case of cache in memory, it may be handled by partitions being dropped 
(in LRU order) when memory fills up. I need to do some more experimentation to 
see if this really works well, or if allowing memory to fill up causes 
performance issues or possibly OOM errors if data isn't correctly freed.

In the case of persisting to disk, I'm not sure if there's a way to limit the 
disk space used for caching. Does anyone know if there is such a configuration 
option? This is a pressing issue for me - I have had jobs fail because nodes 
ran out of disk space.

 

On Wed, Jun 11, 2014 at 2:26 AM, Nick Pentreath  
wrote:

If you want to force materialization use .count()

 

Also if you can simply don't unpersist anything, unless you really need to free 
the memory 

—
Sent from Mailbox <https://www.dropbox.com/mailbox>  

 

On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim 
 wrote:

BTW, it is possible that rdd.first() does not compute the whole partitions. 
So, first() cannot be uses for the situation below. 

-Original Message- 
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr] 
Sent: Wednesday, June 11, 2014 11:40 AM 
To: user@spark.apache.org 
Subject: Question about RDD cache, unpersist, materialization 

Hi, 

What I (seems to) know about RDD persisting API is as follows: 
- cache() and persist() is not an action. It only does a marking. 
- unpersist() is also not an action. It only removes a marking. But if the 
rdd is already in memory, it is unloaded. 

And there seems no API to forcefully materialize the RDD without requiring a 
data by an action method, for example first(). 

So, I am faced with the following scenario. 

{ 
JavaRDD rddUnion = sc.parallelize(new ArrayList()); // create 
empty for merging 
for (int i = 0; i < 10; i++) 
{ 
JavaRDD rdd = sc.textFile(inputFileNames[i]); 
rdd.cache(); // Since it will be used twice, cache. 
rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]); // 
Transform and save, rdd materializes 
rddUnion = rddUnion.union(rdd.map(...).filter(...)); // Do another 
transform to T and merge by union 
rdd.unpersist(); // Now it seems not needed. (But needed actually) 
} 
// Here, rddUnion actually materializes, and needs all 10 rdds that 
already unpersisted. 
// So, rebuilding all 10 rdds will occur. 
rddUnion.saveAsTextFile(

RE: Question about RDD cache, unpersist, materialization

2014-06-12 Thread innowireless TaeYun Kim
Currently I use rdd.count() for forceful computation, as Nick Pentreath 
suggested.

 

I think that it will be nice to have a method that forcefully computes a rdd, 
so that the unnecessary rdds are safely unpersist()ed.

 

Let’s think a case that a rdd_a is a parent of both:

(1) a short-term rdd_s that depends only on the rdd (transformation of rdd_a)

(2) and a long-term rdd_t that depends on the rdds that may be computed later. 
(may be for some aggregation)

Usually rdd_s is computed early, and rdd_a is computed for it. But it has to 
remain in memory until rdd_t is eventually computed. (Or recomputed when rdd_t 
is computed)

It would be nice if rdd_t could be computed when rdd_s is computed, so that 
rdd_a can be unpersist()ed, since it will not be used anymore.

(Currently I use rdd_t.count() for that)

 

sc.prune() which was suggested in the related discussion you provided is not 
helpful for this case, since rdd_a is ‘referenced’ by ‘lazy’ rdd_t.

(Get up rdd_t, and do it with rdd_s while rdd_a is here for rdd_s.)

 

 

From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com] 
Sent: Friday, June 13, 2014 9:31 AM
To: user
Subject: Re: Question about RDD cache, unpersist, materialization

 

FYI: Here is a related discussion 
<http://apache-spark-user-list.1001560.n3.nabble.com/Persist-and-unpersist-td6437.html>
  about this.

 

On Thu, Jun 12, 2014 at 8:10 PM, innowireless TaeYun Kim 
 wrote:

Maybe It would be nice that unpersist() ¡®triggers¡¯ the computations of other 
rdds that depends on it but not yet computed.
The pseudo code can be as follows:

 

unpersist()
{
if (this rdd has not been persisted)
return;
for (all rdds that depends on this rdd but not yet computed)
compute_that_rdd;
do_actual_unpersist();
}

 

From: Daniel Siegmann [mailto:daniel.siegm...@velos.io] 
Sent: Friday, June 13, 2014 5:38 AM
To: user@spark.apache.org
Subject: Re: Question about RDD cache, unpersist, materialization

 

I've run into this issue. The goal of caching / persist seems to be to avoid 
recomputing an RDD when its data will be needed multiple times. However, once 
the following RDDs are computed the cache is no longer needed. The currently 
design provides no obvious way to detect when the cache is no longer needed so 
it can be discarded.

In the case of cache in memory, it may be handled by partitions being dropped 
(in LRU order) when memory fills up. I need to do some more experimentation to 
see if this really works well, or if allowing memory to fill up causes 
performance issues or possibly OOM errors if data isn't correctly freed.

In the case of persisting to disk, I'm not sure if there's a way to limit the 
disk space used for caching. Does anyone know if there is such a configuration 
option? This is a pressing issue for me - I have had jobs fail because nodes 
ran out of disk space.

 

On Wed, Jun 11, 2014 at 2:26 AM, Nick Pentreath  
wrote:

If you want to force materialization use .count()

 

Also if you can simply don't unpersist anything, unless you really need to free 
the memory 

—
Sent from Mailbox <https://www.dropbox.com/mailbox>  

 

On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim 
 wrote:

BTW, it is possible that rdd.first() does not compute the whole partitions. 
So, first() cannot be uses for the situation below. 

-Original Message- 
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr] 
Sent: Wednesday, June 11, 2014 11:40 AM 
To: user@spark.apache.org 
Subject: Question about RDD cache, unpersist, materialization 

Hi, 

What I (seems to) know about RDD persisting API is as follows: 
- cache() and persist() is not an action. It only does a marking. 
- unpersist() is also not an action. It only removes a marking. But if the 
rdd is already in memory, it is unloaded. 

And there seems no API to forcefully materialize the RDD without requiring a 
data by an action method, for example first(). 

So, I am faced with the following scenario. 

{ 
JavaRDD rddUnion = sc.parallelize(new ArrayList()); // create 
empty for merging 
for (int i = 0; i < 10; i++) 
{ 
JavaRDD rdd = sc.textFile(inputFileNames[i]); 
rdd.cache(); // Since it will be used twice, cache. 
rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]); // 
Transform and save, rdd materializes 
rddUnion = rddUnion.union(rdd.map(...).filter(...)); // Do another 
transform to T and merge by union 
rdd.unpersist(); // Now it seems not needed. (But needed actually) 
} 
// Here, rddUnion actually materializes, and needs all 10 rdds that 
already unpersisted. 
// So, rebuilding all 10 rdds will occur. 
rddUnion.saveAsTextFile(mergedFileName); 
} 

If rddUnion can be materialized before the rdd.unpersist() line and 
cache()d, the rdds in the loop will not be needed on 
rddUnion.saveAsTextFile(). 

Now what is the best strategy? 
- Do not unpersist all 10 rdds in the loop. 
- Materialize rddUnion in the loop by

Re: Question about RDD cache, unpersist, materialization

2014-06-12 Thread Nicholas Chammas
FYI: Here is a related discussion
<http://apache-spark-user-list.1001560.n3.nabble.com/Persist-and-unpersist-td6437.html>
about this.


On Thu, Jun 12, 2014 at 8:10 PM, innowireless TaeYun Kim <
taeyun@innowireless.co.kr> wrote:

> Maybe It would be nice that unpersist() ‘triggers’ the computations of
> other rdds that depends on it but not yet computed.
> The pseudo code can be as follows:
>
>
>
> unpersist()
> {
> if (this rdd has not been persisted)
> return;
> for (all rdds that depends on this rdd but not yet computed)
> compute_that_rdd;
> do_actual_unpersist();
> }
>
>
>
> *From:* Daniel Siegmann [mailto:daniel.siegm...@velos.io]
> *Sent:* Friday, June 13, 2014 5:38 AM
> *To:* user@spark.apache.org
> *Subject:* Re: Question about RDD cache, unpersist, materialization
>
>
>
> I've run into this issue. The goal of caching / persist seems to be to
> avoid recomputing an RDD when its data will be needed multiple times.
> However, once the following RDDs are computed the cache is no longer
> needed. The currently design provides no obvious way to detect when the
> cache is no longer needed so it can be discarded.
>
> In the case of cache in memory, it may be handled by partitions being
> dropped (in LRU order) when memory fills up. I need to do some more
> experimentation to see if this really works well, or if allowing memory to
> fill up causes performance issues or possibly OOM errors if data isn't
> correctly freed.
>
> In the case of persisting to disk, I'm not sure if there's a way to limit
> the disk space used for caching. Does anyone know if there is such a
> configuration option? This is a pressing issue for me - I have had jobs
> fail because nodes ran out of disk space.
>
>
>
> On Wed, Jun 11, 2014 at 2:26 AM, Nick Pentreath 
> wrote:
>
> If you want to force materialization use .count()
>
>
>
> Also if you can simply don't unpersist anything, unless you really need to
> free the memory
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
>
> On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim <
> taeyun@innowireless.co.kr> wrote:
>
> BTW, it is possible that rdd.first() does not compute the whole
> partitions.
> So, first() cannot be uses for the situation below.
>
> -Original Message-
> From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
> Sent: Wednesday, June 11, 2014 11:40 AM
> To: user@spark.apache.org
> Subject: Question about RDD cache, unpersist, materialization
>
> Hi,
>
> What I (seems to) know about RDD persisting API is as follows:
> - cache() and persist() is not an action. It only does a marking.
> - unpersist() is also not an action. It only removes a marking. But if the
> rdd is already in memory, it is unloaded.
>
> And there seems no API to forcefully materialize the RDD without requiring
> a
> data by an action method, for example first().
>
> So, I am faced with the following scenario.
>
> {
> JavaRDD rddUnion = sc.parallelize(new ArrayList()); // create
> empty for merging
> for (int i = 0; i < 10; i++)
> {
> JavaRDD rdd = sc.textFile(inputFileNames[i]);
> rdd.cache(); // Since it will be used twice, cache.
> rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]); //
> Transform and save, rdd materializes
> rddUnion = rddUnion.union(rdd.map(...).filter(...)); // Do another
> transform to T and merge by union
> rdd.unpersist(); // Now it seems not needed. (But needed actually)
> }
> // Here, rddUnion actually materializes, and needs all 10 rdds that
> already unpersisted.
> // So, rebuilding all 10 rdds will occur.
> rddUnion.saveAsTextFile(mergedFileName);
> }
>
> If rddUnion can be materialized before the rdd.unpersist() line and
> cache()d, the rdds in the loop will not be needed on
> rddUnion.saveAsTextFile().
>
> Now what is the best strategy?
> - Do not unpersist all 10 rdds in the loop.
> - Materialize rddUnion in the loop by calling 'light' action API, like
> first().
> - Give up and just rebuild/reload all 10 rdds when saving rddUnion.
>
> Is there some misunderstanding?
>
> Thanks.
>
>
>
>
>
>
> --
>
> Daniel Siegmann, Software Developer
> Velos
>
> Accelerating Machine Learning
>
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io
>


RE: Question about RDD cache, unpersist, materialization

2014-06-12 Thread innowireless TaeYun Kim
Maybe It would be nice that unpersist() ‘triggers’ the computations of other 
rdds that depends on it but not yet computed.
The pseudo code can be as follows:

 

unpersist()
{
if (this rdd has not been persisted)
return;
for (all rdds that depends on this rdd but not yet computed)
compute_that_rdd;
do_actual_unpersist();
}



 

From: Daniel Siegmann [mailto:daniel.siegm...@velos.io] 
Sent: Friday, June 13, 2014 5:38 AM
To: user@spark.apache.org
Subject: Re: Question about RDD cache, unpersist, materialization

 

I've run into this issue. The goal of caching / persist seems to be to avoid 
recomputing an RDD when its data will be needed multiple times. However, once 
the following RDDs are computed the cache is no longer needed. The currently 
design provides no obvious way to detect when the cache is no longer needed so 
it can be discarded.

In the case of cache in memory, it may be handled by partitions being dropped 
(in LRU order) when memory fills up. I need to do some more experimentation to 
see if this really works well, or if allowing memory to fill up causes 
performance issues or possibly OOM errors if data isn't correctly freed.

In the case of persisting to disk, I'm not sure if there's a way to limit the 
disk space used for caching. Does anyone know if there is such a configuration 
option? This is a pressing issue for me - I have had jobs fail because nodes 
ran out of disk space.

 

On Wed, Jun 11, 2014 at 2:26 AM, Nick Pentreath  
wrote:

If you want to force materialization use .count()

 

Also if you can simply don't unpersist anything, unless you really need to free 
the memory 

—
Sent from Mailbox <https://www.dropbox.com/mailbox>  

 

On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim 
 wrote:

BTW, it is possible that rdd.first() does not compute the whole partitions. 
So, first() cannot be uses for the situation below. 

-Original Message- 
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr] 
Sent: Wednesday, June 11, 2014 11:40 AM 
To: user@spark.apache.org 
Subject: Question about RDD cache, unpersist, materialization 

Hi, 

What I (seems to) know about RDD persisting API is as follows: 
- cache() and persist() is not an action. It only does a marking. 
- unpersist() is also not an action. It only removes a marking. But if the 
rdd is already in memory, it is unloaded. 

And there seems no API to forcefully materialize the RDD without requiring a 
data by an action method, for example first(). 

So, I am faced with the following scenario. 

{ 
JavaRDD rddUnion = sc.parallelize(new ArrayList()); // create 
empty for merging 
for (int i = 0; i < 10; i++) 
{ 
JavaRDD rdd = sc.textFile(inputFileNames[i]); 
rdd.cache(); // Since it will be used twice, cache. 
rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]); // 
Transform and save, rdd materializes 
rddUnion = rddUnion.union(rdd.map(...).filter(...)); // Do another 
transform to T and merge by union 
rdd.unpersist(); // Now it seems not needed. (But needed actually) 
} 
// Here, rddUnion actually materializes, and needs all 10 rdds that 
already unpersisted. 
// So, rebuilding all 10 rdds will occur. 
rddUnion.saveAsTextFile(mergedFileName); 
} 

If rddUnion can be materialized before the rdd.unpersist() line and 
cache()d, the rdds in the loop will not be needed on 
rddUnion.saveAsTextFile(). 

Now what is the best strategy? 
- Do not unpersist all 10 rdds in the loop. 
- Materialize rddUnion in the loop by calling 'light' action API, like 
first(). 
- Give up and just rebuild/reload all 10 rdds when saving rddUnion. 

Is there some misunderstanding? 

Thanks. 



 




-- 

Daniel Siegmann, Software Developer
Velos

Accelerating Machine Learning


440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E:  <mailto:daniel.siegm...@velos.io> daniel.siegm...@velos.io W:  
<http://www.velos.io> www.velos.io



Re: Question about RDD cache, unpersist, materialization

2014-06-12 Thread Daniel Siegmann
I've run into this issue. The goal of caching / persist seems to be to
avoid recomputing an RDD when its data will be needed multiple times.
However, once the following RDDs are computed the cache is no longer
needed. The currently design provides no obvious way to detect when the
cache is no longer needed so it can be discarded.

In the case of cache in memory, it may be handled by partitions being
dropped (in LRU order) when memory fills up. I need to do some more
experimentation to see if this really works well, or if allowing memory to
fill up causes performance issues or possibly OOM errors if data isn't
correctly freed.

In the case of persisting to disk, I'm not sure if there's a way to limit
the disk space used for caching. Does anyone know if there is such a
configuration option? This is a pressing issue for me - I have had jobs
fail because nodes ran out of disk space.


On Wed, Jun 11, 2014 at 2:26 AM, Nick Pentreath 
wrote:

> If you want to force materialization use .count()
>
> Also if you can simply don't unpersist anything, unless you really need to
> free the memory
> —
> Sent from Mailbox 
>
>
> On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim <
> taeyun@innowireless.co.kr> wrote:
>
>> BTW, it is possible that rdd.first() does not compute the whole
>> partitions.
>> So, first() cannot be uses for the situation below.
>>
>> -Original Message-
>> From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
>> Sent: Wednesday, June 11, 2014 11:40 AM
>> To: user@spark.apache.org
>> Subject: Question about RDD cache, unpersist, materialization
>>
>> Hi,
>>
>> What I (seems to) know about RDD persisting API is as follows:
>> - cache() and persist() is not an action. It only does a marking.
>> - unpersist() is also not an action. It only removes a marking. But if
>> the
>> rdd is already in memory, it is unloaded.
>>
>> And there seems no API to forcefully materialize the RDD without
>> requiring a
>> data by an action method, for example first().
>>
>> So, I am faced with the following scenario.
>>
>> {
>> JavaRDD rddUnion = sc.parallelize(new ArrayList()); // create
>> empty for merging
>> for (int i = 0; i < 10; i++)
>> {
>> JavaRDD rdd = sc.textFile(inputFileNames[i]);
>> rdd.cache(); // Since it will be used twice, cache.
>> rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]); //
>> Transform and save, rdd materializes
>> rddUnion = rddUnion.union(rdd.map(...).filter(...)); // Do another
>> transform to T and merge by union
>> rdd.unpersist(); // Now it seems not needed. (But needed actually)
>> }
>> // Here, rddUnion actually materializes, and needs all 10 rdds that
>> already unpersisted.
>> // So, rebuilding all 10 rdds will occur.
>> rddUnion.saveAsTextFile(mergedFileName);
>> }
>>
>> If rddUnion can be materialized before the rdd.unpersist() line and
>> cache()d, the rdds in the loop will not be needed on
>> rddUnion.saveAsTextFile().
>>
>> Now what is the best strategy?
>> - Do not unpersist all 10 rdds in the loop.
>> - Materialize rddUnion in the loop by calling 'light' action API, like
>> first().
>> - Give up and just rebuild/reload all 10 rdds when saving rddUnion.
>>
>> Is there some misunderstanding?
>>
>> Thanks.
>>
>>
>>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


RE: Question about RDD cache, unpersist, materialization

2014-06-10 Thread Nick Pentreath
If you want to force materialization use .count()


Also if you can simply don't unpersist anything, unless you really need to free 
the memory 
—
Sent from Mailbox

On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim
 wrote:

> BTW, it is possible that rdd.first() does not compute the whole partitions.
> So, first() cannot be uses for the situation below.
> -Original Message-
> From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr] 
> Sent: Wednesday, June 11, 2014 11:40 AM
> To: user@spark.apache.org
> Subject: Question about RDD cache, unpersist, materialization
> Hi,
> What I (seems to) know about RDD persisting API is as follows:
> - cache() and persist() is not an action. It only does a marking.
> - unpersist() is also not an action. It only removes a marking. But if the
> rdd is already in memory, it is unloaded.
> And there seems no API to forcefully materialize the RDD without requiring a
> data by an action method, for example first().
> So, I am faced with the following scenario.
> {
> JavaRDD rddUnion = sc.parallelize(new ArrayList());  // create
> empty for merging
> for (int i = 0; i < 10; i++)
> {
> JavaRDD rdd = sc.textFile(inputFileNames[i]);
> rdd.cache();  // Since it will be used twice, cache.
> rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]);  //
> Transform and save, rdd materializes
> rddUnion = rddUnion.union(rdd.map(...).filter(...));  // Do another
> transform to T and merge by union
> rdd.unpersist();  // Now it seems not needed. (But needed actually)
> }
> // Here, rddUnion actually materializes, and needs all 10 rdds that
> already unpersisted.
> // So, rebuilding all 10 rdds will occur.
> rddUnion.saveAsTextFile(mergedFileName);
> }
> If rddUnion can be materialized before the rdd.unpersist() line and
> cache()d, the rdds in the loop will not be needed on
> rddUnion.saveAsTextFile().
> Now what is the best strategy?
> - Do not unpersist all 10 rdds in the loop.
> - Materialize rddUnion in the loop by calling 'light' action API, like
> first().
> - Give up and just rebuild/reload all 10 rdds when saving rddUnion.
> Is there some misunderstanding?
> Thanks.

RE: Question about RDD cache, unpersist, materialization

2014-06-10 Thread innowireless TaeYun Kim
BTW, it is possible that rdd.first() does not compute the whole partitions.
So, first() cannot be uses for the situation below.

-Original Message-
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr] 
Sent: Wednesday, June 11, 2014 11:40 AM
To: user@spark.apache.org
Subject: Question about RDD cache, unpersist, materialization

Hi,

What I (seems to) know about RDD persisting API is as follows:
- cache() and persist() is not an action. It only does a marking.
- unpersist() is also not an action. It only removes a marking. But if the
rdd is already in memory, it is unloaded.

And there seems no API to forcefully materialize the RDD without requiring a
data by an action method, for example first().

So, I am faced with the following scenario.

{
JavaRDD rddUnion = sc.parallelize(new ArrayList());  // create
empty for merging
for (int i = 0; i < 10; i++)
{
JavaRDD rdd = sc.textFile(inputFileNames[i]);
rdd.cache();  // Since it will be used twice, cache.
rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]);  //
Transform and save, rdd materializes
rddUnion = rddUnion.union(rdd.map(...).filter(...));  // Do another
transform to T and merge by union
rdd.unpersist();  // Now it seems not needed. (But needed actually)
}
// Here, rddUnion actually materializes, and needs all 10 rdds that
already unpersisted.
// So, rebuilding all 10 rdds will occur.
rddUnion.saveAsTextFile(mergedFileName);
}

If rddUnion can be materialized before the rdd.unpersist() line and
cache()d, the rdds in the loop will not be needed on
rddUnion.saveAsTextFile().

Now what is the best strategy?
- Do not unpersist all 10 rdds in the loop.
- Materialize rddUnion in the loop by calling 'light' action API, like
first().
- Give up and just rebuild/reload all 10 rdds when saving rddUnion.

Is there some misunderstanding?

Thanks.