Cache is more general. ReduceByKey involves a shuffle step where the data will be in memory and on disk (for what doesn't hold in memory). The shuffle files will remain around until the end of the job. The blocks from memory will be dropped if memory is needed for other things. This is an optimisation so other rdds that depend on the result of this shuffle don't have to go through all the chain. They just fetch the shuffle blocks from memory/disk.
Calling cache in this example gives near the same result (I guess there are some impl. specific differences). But if there wasn't a shuffle step then cache would explicitly persist this dataset, however not on disk except if you say it to. Eugen 2015-06-17 15:10 GMT+02:00 canan chen <ccn...@gmail.com>: > Yes, actually on the storage ui, there's no data cached. But the behavior > confuse me. If I call the cache method as following the behavior is the > same as without calling cache method, why's that ? > > > val data = sc.parallelize(1 to 10, 2).map(e=>(e%2,2)).reduceByKey(_ + _, 2) > data.cache() > println(data.count()) > println(data.count()) > > > > On Wed, Jun 17, 2015 at 8:45 PM, ayan guha <guha.a...@gmail.com> wrote: > >> Its not cached per se. For example, you will not see this in Storage tab >> in UI. However, spark has read the data and its in memory right now. So, >> the next count call should be very fast. >> >> >> Best >> Ayan >> >> On Wed, Jun 17, 2015 at 10:21 PM, Mark Tse <mark....@d2l.com> wrote: >> >>> I think >>> https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence >>> might shed some light on the behaviour you’re seeing. >>> >>> >>> >>> Mark >>> >>> >>> >>> *From:* canan chen [mailto:ccn...@gmail.com] >>> *Sent:* June-17-15 5:57 AM >>> *To:* spark users >>> *Subject:* Intermedate stage will be cached automatically ? >>> >>> >>> >>> Here's one simple spark example that I call RDD#count 2 times. The first >>> time it would invoke 2 stages, but the second one only need 1 stage. Seems >>> the first stage is cached. Is that true ? Any flag can I control whether >>> the cache the intermediate stage >>> >>> >>> *val *data = sc.parallelize(1 to 10, 2).map(e=>(e%2,2)).reduceByKey(_ + >>> _, 2) >>> *println*(data.count()) >>> *println*(data.count()) >>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > >