Re: Spark Disk Usage

Surendranauth Hiraman Wed, 09 Apr 2014 06:53:49 -0700

Yes, MEMORY_AND_DISK.

We do a groupByKey and then call persist on the resulting RDD. So I'm
wondering if groupByKey is aware of the subsequent persist setting to use
disk or just creates the Seq[V] in memory and only uses disk after that
data structure is fully realized in memory.


-Suren



On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash <[email protected]> wrote:

> Which persistence level are you talking about? MEMORY_AND_DISK ?
>
> Sent from my mobile phone
> On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" <[email protected]>
> wrote:
>
>> Thanks, Andrew. That helps.
>>
>> For 1, it sounds like the data for the RDD is held in memory and then
>> only written to disk after the entire RDD has been realized in memory. Is
>> that correct?
>>
>> -Suren
>>
>>
>>
>> On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash <[email protected]> wrote:
>>
>>> For 1, persist can be used to save an RDD to disk using the various
>>> persistence levels.  When a persistency level is set on an RDD, when that
>>> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
>>> re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
>>> use the cached value.
>>>
>>>
>>> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence
>>>
>>> 2. The other places disk is used most commonly is shuffles.  If you have
>>> data across the cluster that comes from a source, then you might not have
>>> to hold it all in memory at once.  But if you do a shuffle, which scatters
>>> the data across the cluster in a certain way, then you have to have the
>>> memory/disk available for that RDD all at once.  In that case, shuffles
>>> will sometimes need to spill over to disk for large RDDs, which can be
>>> controlled with the spark.shuffle.spill setting.
>>>
>>> Does that help clarify?
>>>
>>>
>>> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
>>> [email protected]> wrote:
>>>
>>>> It might help if I clarify my questions. :-)
>>>>
>>>> 1. Is persist() applied during the transformation right before the
>>>> persist() call in the graph? Or is is applied after the transform's
>>>> processing is complete? In the case of things like GroupBy, is the Seq
>>>> backed by disk as it is being created? We're trying to get a sense of how
>>>> the processing is handled behind the scenes with respect to disk.
>>>>
>>>> 2. When else is disk used internally?
>>>>
>>>> Any pointers are appreciated.
>>>>
>>>> -Suren
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Any thoughts on this? Thanks.
>>>>>
>>>>> -Suren
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I know if we call persist with the right options, we can have Spark
>>>>>> persist an RDD's data on disk.
>>>>>>
>>>>>> I am wondering what happens in intermediate operations that could
>>>>>> conceivably create large collections/Sequences, like GroupBy and 
>>>>>> shuffling.
>>>>>>
>>>>>> Basically, one part of the question is when is disk used internally?
>>>>>>
>>>>>> And is calling persist() on the RDD returned by such transformations
>>>>>> what let's it know to use disk in those situations? Trying to understand 
>>>>>> if
>>>>>> persist() is applied during the transformation or after it.
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>>
>>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>>> Velos
>>>>>> Accelerating Machine Learning
>>>>>>
>>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>>> NEW YORK, NY 10001
>>>>>> O: (917) 525-2466 ext. 105
>>>>>> F: 646.349.4063
>>>>>> E: suren.hiraman@v <[email protected]>elos.io
>>>>>> W: www.velos.io
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>> Velos
>>>>> Accelerating Machine Learning
>>>>>
>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>> NEW YORK, NY 10001
>>>>> O: (917) 525-2466 ext. 105
>>>>> F: 646.349.4063
>>>>> E: suren.hiraman@v <[email protected]>elos.io
>>>>> W: www.velos.io
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>> Velos
>>>> Accelerating Machine Learning
>>>>
>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>> NEW YORK, NY 10001
>>>> O: (917) 525-2466 ext. 105
>>>> F: 646.349.4063
>>>> E: suren.hiraman@v <[email protected]>elos.io
>>>> W: www.velos.io
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <[email protected]>elos.io
>> W: www.velos.io
>>
>>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <[email protected]>elos.io
W: www.velos.io

Re: Spark Disk Usage

Reply via email to