RE: Intermedate stage will be cached automatically ?

2015-06-17 Thread Mark Tse
I think https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence might shed some light on the behaviour you’re seeing. Mark From: canan chen [mailto:ccn...@gmail.com] Sent: June-17-15 5:57 AM To: spark users Subject: Intermedate stage will be cached automatically ? Here's

Re: Intermedate stage will be cached automatically ?

2015-06-17 Thread Eugen Cepoi
Cache is more general. ReduceByKey involves a shuffle step where the data will be in memory and on disk (for what doesn't hold in memory). The shuffle files will remain around until the end of the job. The blocks from memory will be dropped if memory is needed for other things. This is an

Re: Intermedate stage will be cached automatically ?

2015-06-17 Thread canan chen
Yes, actually on the storage ui, there's no data cached. But the behavior confuse me. If I call the cache method as following the behavior is the same as without calling cache method, why's that ? val data = sc.parallelize(1 to 10, 2).map(e=(e%2,2)).reduceByKey(_ + _, 2) data.cache()

Re: Intermedate stage will be cached automatically ?

2015-06-17 Thread ayan guha
Its not cached per se. For example, you will not see this in Storage tab in UI. However, spark has read the data and its in memory right now. So, the next count call should be very fast. Best Ayan On Wed, Jun 17, 2015 at 10:21 PM, Mark Tse mark@d2l.com wrote: I think