For 1, persist can be used to save an RDD to disk using the various
persistence levels.  When a persistency level is set on an RDD, when that
RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
use the cached value.

https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence

2. The other places disk is used most commonly is shuffles.  If you have
data across the cluster that comes from a source, then you might not have
to hold it all in memory at once.  But if you do a shuffle, which scatters
the data across the cluster in a certain way, then you have to have the
memory/disk available for that RDD all at once.  In that case, shuffles
will sometimes need to spill over to disk for large RDDs, which can be
controlled with the spark.shuffle.spill setting.

Does that help clarify?


On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:

> It might help if I clarify my questions. :-)
>
> 1. Is persist() applied during the transformation right before the
> persist() call in the graph? Or is is applied after the transform's
> processing is complete? In the case of things like GroupBy, is the Seq
> backed by disk as it is being created? We're trying to get a sense of how
> the processing is handled behind the scenes with respect to disk.
>
> 2. When else is disk used internally?
>
> Any pointers are appreciated.
>
> -Suren
>
>
>
>
> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
> suren.hira...@velos.io> wrote:
>
>> Hi,
>>
>> Any thoughts on this? Thanks.
>>
>> -Suren
>>
>>
>>
>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
>> suren.hira...@velos.io> wrote:
>>
>>> Hi,
>>>
>>> I know if we call persist with the right options, we can have Spark
>>> persist an RDD's data on disk.
>>>
>>> I am wondering what happens in intermediate operations that could
>>> conceivably create large collections/Sequences, like GroupBy and shuffling.
>>>
>>> Basically, one part of the question is when is disk used internally?
>>>
>>> And is calling persist() on the RDD returned by such transformations
>>> what let's it know to use disk in those situations? Trying to understand if
>>> persist() is applied during the transformation or after it.
>>>
>>> Thank you.
>>>
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>>> W: www.velos.io
>>>
>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>> W: www.velos.io
>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
> W: www.velos.io
>
>

Reply via email to