Cache is more general. ReduceByKey involves a shuffle step where the data
will be in memory and on disk (for what doesn't hold in memory). The
shuffle files will remain around until the end of the job. The blocks from
memory will be dropped if memory is needed for other things. This is an
optimisation so other rdds that depend on the result of this shuffle don't
have to go through all the chain. They just fetch the shuffle blocks from
memory/disk.

Calling cache in this example gives near the same result (I guess there are
some impl. specific differences). But if there wasn't a shuffle step then
cache would explicitly persist this dataset, however not on disk except if
you say it to.

Eugen

2015-06-17 15:10 GMT+02:00 canan chen <ccn...@gmail.com>:

> Yes, actually on the storage ui, there's no data cached. But the behavior
> confuse me. If I call the cache method as following the behavior is the
> same as without calling cache method, why's that ?
>
>
> val data = sc.parallelize(1 to 10, 2).map(e=>(e%2,2)).reduceByKey(_ + _, 2)
> data.cache()
> println(data.count())
> println(data.count())
>
>
>
> On Wed, Jun 17, 2015 at 8:45 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Its not cached per se. For example, you will not see this in Storage tab
>> in UI. However, spark has read the data and its in memory right now. So,
>> the next count call should be very fast.
>>
>>
>> Best
>> Ayan
>>
>> On Wed, Jun 17, 2015 at 10:21 PM, Mark Tse <mark....@d2l.com> wrote:
>>
>>>  I think
>>> https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
>>> might shed some light on the behaviour you’re seeing.
>>>
>>>
>>>
>>> Mark
>>>
>>>
>>>
>>> *From:* canan chen [mailto:ccn...@gmail.com]
>>> *Sent:* June-17-15 5:57 AM
>>> *To:* spark users
>>> *Subject:* Intermedate stage will be cached automatically ?
>>>
>>>
>>>
>>> Here's one simple spark example that I call RDD#count 2 times. The first
>>> time it would invoke 2 stages, but the second one only need 1 stage. Seems
>>> the first stage is cached. Is that true ? Any flag can I control whether
>>> the cache the intermediate stage
>>>
>>>
>>>     *val *data = sc.parallelize(1 to 10, 2).map(e=>(e%2,2)).reduceByKey(_ + 
>>> _, 2)
>>>     *println*(data.count())
>>>     *println*(data.count())
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

Reply via email to