[jira] [Updated] (SPARK-1962) Add RDD cache reference counting

2014-05-29 Thread Taeyun Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taeyun Kim updated SPARK-1962:
--

Description: 
It would be nice if the RDD cache() method incorporate a reference counting 
information.

That is,

{code}
void test()
{

JavaRDD... rdd = ...;


rdd.cache();  // to depth 1. actual caching happens.
rdd.cache();  // to depth 2. Nop as long as the storage level is the same. 
Else, exception.

...

rdd.uncache();  // to depth 1. Nop.
rdd.uncache();  // to depth 0. Actual unpersist happens.
}
{code}

This can be useful when writing code in modular way.
When a function receives an rdd as an argument, it doesn't necessarily know the 
cache status of the rdd.
But it could want to cache the rdd, since it will use the rdd multiple times.
But with the current RDD API, it cannot determine whether it should unpersist 
it or leave it alone (so that caller can continue to use that rdd without 
rebuilding).


 Add RDD cache reference counting
 

 Key: SPARK-1962
 URL: https://issues.apache.org/jira/browse/SPARK-1962
 Project: Spark
  Issue Type: New Feature
Reporter: Taeyun Kim
Priority: Minor

 It would be nice if the RDD cache() method incorporate a reference counting 
 information.
 That is,
 {code}
 void test()
 {
 JavaRDD... rdd = ...;
 rdd.cache();  // to depth 1. actual caching happens.
 rdd.cache();  // to depth 2. Nop as long as the storage level is the same. 
 Else, exception.
 ...
 rdd.uncache();  // to depth 1. Nop.
 rdd.uncache();  // to depth 0. Actual unpersist happens.
 }
 {code}
 This can be useful when writing code in modular way.
 When a function receives an rdd as an argument, it doesn't necessarily know 
 the cache status of the rdd.
 But it could want to cache the rdd, since it will use the rdd multiple times.
 But with the current RDD API, it cannot determine whether it should unpersist 
 it or leave it alone (so that caller can continue to use that rdd without 
 rebuilding).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1962) Add RDD cache reference counting

2014-05-29 Thread Taeyun Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taeyun Kim updated SPARK-1962:
--

Affects Version/s: 1.0.0

 Add RDD cache reference counting
 

 Key: SPARK-1962
 URL: https://issues.apache.org/jira/browse/SPARK-1962
 Project: Spark
  Issue Type: New Feature
Affects Versions: 1.0.0
Reporter: Taeyun Kim
Priority: Minor

 It would be nice if the RDD cache() method incorporate a reference counting 
 information.
 That is,
 {code}
 void test()
 {
 JavaRDD... rdd = ...;
 rdd.cache();  // to reference count 1. actual caching happens.
 rdd.cache();  // to reference count 2. Nop as long as the storage level 
 is the same. Else, exception.
 ...
 rdd.uncache();  // to reference count 1. Nop.
 rdd.uncache();  // to reference count 0. Actual unpersist happens.
 }
 {code}
 This can be useful when writing code in modular way.
 When a function receives an RDD as an argument, it doesn't necessarily know 
 the cache status of the RDD.
 But it could want to cache the RDD, since it will use the RDD multiple times.
 But with the current RDD API, it cannot determine whether it should unpersist 
 it or leave it alone (so that the caller can continue to use that RDD without 
 rebuilding).
 For API compatibility, introducing a new method or adding a parameter may be 
 required.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1962) Add RDD cache reference counting

2014-05-29 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1962:
---

Component/s: Spark Core

 Add RDD cache reference counting
 

 Key: SPARK-1962
 URL: https://issues.apache.org/jira/browse/SPARK-1962
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Taeyun Kim
Priority: Minor

 It would be nice if the RDD cache() method incorporate a reference counting 
 information.
 That is,
 {code}
 void test()
 {
 JavaRDD... rdd = ...;
 rdd.cache();  // to reference count 1. actual caching happens.
 rdd.cache();  // to reference count 2. Nop as long as the storage level 
 is the same. Else, exception.
 ...
 rdd.uncache();  // to reference count 1. Nop.
 rdd.uncache();  // to reference count 0. Actual unpersist happens.
 }
 {code}
 This can be useful when writing code in modular way.
 When a function receives an RDD as an argument, it doesn't necessarily know 
 the cache status of the RDD.
 But it could want to cache the RDD, since it will use the RDD multiple times.
 But with the current RDD API, it cannot determine whether it should unpersist 
 it or leave it alone (so that the caller can continue to use that RDD without 
 rebuilding).
 For API compatibility, introducing a new method or adding a parameter may be 
 required.



--
This message was sent by Atlassian JIRA
(v6.2#6252)