Yes, actually on the storage ui, there's no data cached. But the behavior confuse me. If I call the cache method as following the behavior is the same as without calling cache method, why's that ?
val data = sc.parallelize(1 to 10, 2).map(e=>(e%2,2)).reduceByKey(_ + _, 2) data.cache() println(data.count()) println(data.count()) On Wed, Jun 17, 2015 at 8:45 PM, ayan guha <guha.a...@gmail.com> wrote: > Its not cached per se. For example, you will not see this in Storage tab > in UI. However, spark has read the data and its in memory right now. So, > the next count call should be very fast. > > > Best > Ayan > > On Wed, Jun 17, 2015 at 10:21 PM, Mark Tse <mark....@d2l.com> wrote: > >> I think >> https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence >> might shed some light on the behaviour you’re seeing. >> >> >> >> Mark >> >> >> >> *From:* canan chen [mailto:ccn...@gmail.com] >> *Sent:* June-17-15 5:57 AM >> *To:* spark users >> *Subject:* Intermedate stage will be cached automatically ? >> >> >> >> Here's one simple spark example that I call RDD#count 2 times. The first >> time it would invoke 2 stages, but the second one only need 1 stage. Seems >> the first stage is cached. Is that true ? Any flag can I control whether >> the cache the intermediate stage >> >> >> *val *data = sc.parallelize(1 to 10, 2).map(e=>(e%2,2)).reduceByKey(_ + >> _, 2) >> *println*(data.count()) >> *println*(data.count()) >> >> > > > -- > Best Regards, > Ayan Guha >