I think
https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
might shed some light on the behaviour you’re seeing.
Mark
From: canan chen [mailto:ccn...@gmail.com]
Sent: June-17-15 5:57 AM
To: spark users
Subject: Intermedate stage will be cached automatically ?
Here's
Cache is more general. ReduceByKey involves a shuffle step where the data
will be in memory and on disk (for what doesn't hold in memory). The
shuffle files will remain around until the end of the job. The blocks from
memory will be dropped if memory is needed for other things. This is an
Yes, actually on the storage ui, there's no data cached. But the behavior
confuse me. If I call the cache method as following the behavior is the
same as without calling cache method, why's that ?
val data = sc.parallelize(1 to 10, 2).map(e=(e%2,2)).reduceByKey(_ + _, 2)
data.cache()
Its not cached per se. For example, you will not see this in Storage tab in
UI. However, spark has read the data and its in memory right now. So, the
next count call should be very fast.
Best
Ayan
On Wed, Jun 17, 2015 at 10:21 PM, Mark Tse mark@d2l.com wrote:
I think