Re: Spark Disk Usage

Andrew Ash Wed, 09 Apr 2014 06:48:13 -0700

Which persistence level are you talking about? MEMORY_AND_DISK ?

Sent from my mobile phone
On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" <suren.hira...@velos.io>
wrote:


> Thanks, Andrew. That helps.
>
> For 1, it sounds like the data for the RDD is held in memory and then only
> written to disk after the entire RDD has been realized in memory. Is that
> correct?
>
> -Suren
>
>
>
> On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash <and...@andrewash.com> wrote:
>
>> For 1, persist can be used to save an RDD to disk using the various
>> persistence levels.  When a persistency level is set on an RDD, when that
>> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
>> re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
>> use the cached value.
>>
>>
>> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence
>>
>> 2. The other places disk is used most commonly is shuffles.  If you have
>> data across the cluster that comes from a source, then you might not have
>> to hold it all in memory at once.  But if you do a shuffle, which scatters
>> the data across the cluster in a certain way, then you have to have the
>> memory/disk available for that RDD all at once.  In that case, shuffles
>> will sometimes need to spill over to disk for large RDDs, which can be
>> controlled with the spark.shuffle.spill setting.
>>
>> Does that help clarify?
>>
>>
>> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
>> suren.hira...@velos.io> wrote:
>>
>>> It might help if I clarify my questions. :-)
>>>
>>> 1. Is persist() applied during the transformation right before the
>>> persist() call in the graph? Or is is applied after the transform's
>>> processing is complete? In the case of things like GroupBy, is the Seq
>>> backed by disk as it is being created? We're trying to get a sense of how
>>> the processing is handled behind the scenes with respect to disk.
>>>
>>> 2. When else is disk used internally?
>>>
>>> Any pointers are appreciated.
>>>
>>> -Suren
>>>
>>>
>>>
>>>
>>> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
>>> suren.hira...@velos.io> wrote:
>>>
>>>> Hi,
>>>>
>>>> Any thoughts on this? Thanks.
>>>>
>>>> -Suren
>>>>
>>>>
>>>>
>>>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
>>>> suren.hira...@velos.io> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I know if we call persist with the right options, we can have Spark
>>>>> persist an RDD's data on disk.
>>>>>
>>>>> I am wondering what happens in intermediate operations that could
>>>>> conceivably create large collections/Sequences, like GroupBy and 
>>>>> shuffling.
>>>>>
>>>>> Basically, one part of the question is when is disk used internally?
>>>>>
>>>>> And is calling persist() on the RDD returned by such transformations
>>>>> what let's it know to use disk in those situations? Trying to understand 
>>>>> if
>>>>> persist() is applied during the transformation or after it.
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>> Velos
>>>>> Accelerating Machine Learning
>>>>>
>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>> NEW YORK, NY 10001
>>>>> O: (917) 525-2466 ext. 105
>>>>> F: 646.349.4063
>>>>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>>>>> W: www.velos.io
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>> Velos
>>>> Accelerating Machine Learning
>>>>
>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>> NEW YORK, NY 10001
>>>> O: (917) 525-2466 ext. 105
>>>> F: 646.349.4063
>>>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>>>> W: www.velos.io
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>>> W: www.velos.io
>>>
>>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
> W: www.velos.io
>
>

Re: Spark Disk Usage

Reply via email to