Re: Understanding and optimizing spark disk usage during a job.

2014-11-29 Thread Vikas Agarwal
I may not be correct (in fact I may be completely opposite), but here is my
guess:

Assuming 8 bytes for double, 4000 vectors of dimension 400 for 12k images,
would require 153.6 GB (12k*4000*400*8) of data which may justify the
amount of data to be written to the disk. Without compression, it seems it
would be using roughly that much data. You can further cross check what is
the storage level for your RDDs, default is MEMORY_ONLY. In case it is also
spilling data to disk, it would further increase the storage needed.

On Fri, Nov 28, 2014 at 10:43 PM, Jaonary Rabarisoa 
wrote:

> Dear all,
>
> I have a job that crashes before its end because of no space left on
> device, and I noticed that this job generates a lots of temporary data on
> my disk.
>
> To be precise, the job is a simple map job that takes a set of images,
> extracts local features and save these local features as a sequence file.
> My images are represented as a key value pair where the key are strings
> representing the id of the image (the filename) and the values are the
> base64 encoding of the images.
>
> To extract the features, I use an external c program that I call with
> RDD.pipe. I stream the base64 image to the c program and it sends back the
> extracted feature vectors through stdout. Each line represents one feature
> vector from the current image. I don't use any serialization library, I
> just write the feature vector element on the stdout separated by space.
> Once in spark, I just split the line and create a scala vector from each
> value and save my sequence file.
>
> The overall job looks like the following :
>
> val images: RDD[(String, String) = ...
> val features: RDD[(String, Vector)] = images.pipe(...).map(_split(" ")...)
> features.saveAsSequenceFile(...)
>
> The problem is that for about 3G of image data (about 12000 images) this
> job generates more than 180G of temporary data. It seems to be strange
> since for each image I have about 4000 double feature vectors of dimension
> 400.
>
> I run the job on my laptop for test purpose that why I can't add
> additional disk space. By the way, I need to understand why this simple job
> generates such a lot of data and how can I reduce this ?
>
>
> Best,
>
> Jao
>
>
>
>


-- 
Regards,
Vikas Agarwal
91 – 9928301411

InfoObjects, Inc.
Execution Matters
http://www.infoobjects.com
2041 Mission College Boulevard, #280
Santa Clara, CA 95054
+1 (408) 988-2000 Work
+1 (408) 716-2726 Fax


Understanding and optimizing spark disk usage during a job.

2014-11-28 Thread Jaonary Rabarisoa
Dear all,

I have a job that crashes before its end because of no space left on
device, and I noticed that this job generates a lots of temporary data on
my disk.

To be precise, the job is a simple map job that takes a set of images,
extracts local features and save these local features as a sequence file.
My images are represented as a key value pair where the key are strings
representing the id of the image (the filename) and the values are the
base64 encoding of the images.

To extract the features, I use an external c program that I call with
RDD.pipe. I stream the base64 image to the c program and it sends back the
extracted feature vectors through stdout. Each line represents one feature
vector from the current image. I don't use any serialization library, I
just write the feature vector element on the stdout separated by space.
Once in spark, I just split the line and create a scala vector from each
value and save my sequence file.

The overall job looks like the following :

val images: RDD[(String, String) = ...
val features: RDD[(String, Vector)] = images.pipe(...).map(_split(" ")...)
features.saveAsSequenceFile(...)

The problem is that for about 3G of image data (about 12000 images) this
job generates more than 180G of temporary data. It seems to be strange
since for each image I have about 4000 double feature vectors of dimension
400.

I run the job on my laptop for test purpose that why I can't add additional
disk space. By the way, I need to understand why this simple job generates
such a lot of data and how can I reduce this ?


Best,

Jao


Re: Spark Disk Usage

2014-04-09 Thread Surendranauth Hiraman
Andrew,

Thanks a lot for the pointer to the code! This has answered my question.

Looks like it tries to write it to memory first and then if it doesn't fit,
it spills to disk. I'll have to dig in more to figure out the details.

-Suren



On Wed, Apr 9, 2014 at 12:46 PM, Andrew Ash  wrote:

> The groupByKey would be aware of the subsequent persist -- that's part of
> the reason why operations are lazy.  As far as whether it's materialized in
> memory first and then flushed to disk vs streamed to disk I'm not sure the
> exact behavior.
>
> What I'd expect to happen would be that the RDD is materialized in memory
> up until it fills up the BlockManager.  At that point it starts spilling
> blocks out to disk in order to keep from OOM'ing.  I'm not sure if new
> blocks go straight to disk or if the BlockManager pages already-existing
> blocks out in order to make room for new blocks.
>
> You can always read through source to figure it out though!
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L588
>
>
>
>
> On Wed, Apr 9, 2014 at 6:52 AM, Surendranauth Hiraman <
> suren.hira...@velos.io> wrote:
>
>> Yes, MEMORY_AND_DISK.
>>
>> We do a groupByKey and then call persist on the resulting RDD. So I'm
>> wondering if groupByKey is aware of the subsequent persist setting to use
>> disk or just creates the Seq[V] in memory and only uses disk after that
>> data structure is fully realized in memory.
>>
>> -Suren
>>
>>
>>
>> On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash  wrote:
>>
>>> Which persistence level are you talking about? MEMORY_AND_DISK ?
>>>
>>> Sent from my mobile phone
>>> On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" 
>>> wrote:
>>>
 Thanks, Andrew. That helps.

 For 1, it sounds like the data for the RDD is held in memory and then
 only written to disk after the entire RDD has been realized in memory. Is
 that correct?

 -Suren



 On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash wrote:

> For 1, persist can be used to save an RDD to disk using the various
> persistence levels.  When a persistency level is set on an RDD, when that
> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
> re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
> use the cached value.
>
>
> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence
>
> 2. The other places disk is used most commonly is shuffles.  If you
> have data across the cluster that comes from a source, then you might not
> have to hold it all in memory at once.  But if you do a shuffle, which
> scatters the data across the cluster in a certain way, then you have to
> have the memory/disk available for that RDD all at once.  In that case,
> shuffles will sometimes need to spill over to disk for large RDDs, which
> can be controlled with the spark.shuffle.spill setting.
>
> Does that help clarify?
>
>
> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
> suren.hira...@velos.io> wrote:
>
>> It might help if I clarify my questions. :-)
>>
>> 1. Is persist() applied during the transformation right before the
>> persist() call in the graph? Or is is applied after the transform's
>> processing is complete? In the case of things like GroupBy, is the Seq
>> backed by disk as it is being created? We're trying to get a sense of how
>> the processing is handled behind the scenes with respect to disk.
>>
>> 2. When else is disk used internally?
>>
>> Any pointers are appreciated.
>>
>> -Suren
>>
>>
>>
>>
>> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
>> suren.hira...@velos.io> wrote:
>>
>>> Hi,
>>>
>>> Any thoughts on this? Thanks.
>>>
>>> -Suren
>>>
>>>
>>>
>>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
>>> suren.hira...@velos.io> wrote:
>>>
 Hi,

 I know if we call persist with the right options, we can have Spark
 persist an RDD's data on disk.

 I am wondering what happens in intermediate operations that could
 conceivably create large collections/Sequences, like GroupBy and 
 shuffling.

 Basically, one part of the question is when is disk used internally?

 And is calling persist() on the RDD returned by such
 transformations what let's it know to use disk in those situations? 
 Trying
 to understand if persist() is applied during the transformation or 
 after it.

 Thank you.


 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
The groupByKey would be aware of the subsequent persist -- that's part of
the reason why operations are lazy.  As far as whether it's materialized in
memory first and then flushed to disk vs streamed to disk I'm not sure the
exact behavior.

What I'd expect to happen would be that the RDD is materialized in memory
up until it fills up the BlockManager.  At that point it starts spilling
blocks out to disk in order to keep from OOM'ing.  I'm not sure if new
blocks go straight to disk or if the BlockManager pages already-existing
blocks out in order to make room for new blocks.

You can always read through source to figure it out though!

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L588




On Wed, Apr 9, 2014 at 6:52 AM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:

> Yes, MEMORY_AND_DISK.
>
> We do a groupByKey and then call persist on the resulting RDD. So I'm
> wondering if groupByKey is aware of the subsequent persist setting to use
> disk or just creates the Seq[V] in memory and only uses disk after that
> data structure is fully realized in memory.
>
> -Suren
>
>
>
> On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash  wrote:
>
>> Which persistence level are you talking about? MEMORY_AND_DISK ?
>>
>> Sent from my mobile phone
>> On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" 
>> wrote:
>>
>>> Thanks, Andrew. That helps.
>>>
>>> For 1, it sounds like the data for the RDD is held in memory and then
>>> only written to disk after the entire RDD has been realized in memory. Is
>>> that correct?
>>>
>>> -Suren
>>>
>>>
>>>
>>> On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash  wrote:
>>>
 For 1, persist can be used to save an RDD to disk using the various
 persistence levels.  When a persistency level is set on an RDD, when that
 RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
 re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
 use the cached value.


 https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence

 2. The other places disk is used most commonly is shuffles.  If you
 have data across the cluster that comes from a source, then you might not
 have to hold it all in memory at once.  But if you do a shuffle, which
 scatters the data across the cluster in a certain way, then you have to
 have the memory/disk available for that RDD all at once.  In that case,
 shuffles will sometimes need to spill over to disk for large RDDs, which
 can be controlled with the spark.shuffle.spill setting.

 Does that help clarify?


 On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
 suren.hira...@velos.io> wrote:

> It might help if I clarify my questions. :-)
>
> 1. Is persist() applied during the transformation right before the
> persist() call in the graph? Or is is applied after the transform's
> processing is complete? In the case of things like GroupBy, is the Seq
> backed by disk as it is being created? We're trying to get a sense of how
> the processing is handled behind the scenes with respect to disk.
>
> 2. When else is disk used internally?
>
> Any pointers are appreciated.
>
> -Suren
>
>
>
>
> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
> suren.hira...@velos.io> wrote:
>
>> Hi,
>>
>> Any thoughts on this? Thanks.
>>
>> -Suren
>>
>>
>>
>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
>> suren.hira...@velos.io> wrote:
>>
>>> Hi,
>>>
>>> I know if we call persist with the right options, we can have Spark
>>> persist an RDD's data on disk.
>>>
>>> I am wondering what happens in intermediate operations that could
>>> conceivably create large collections/Sequences, like GroupBy and 
>>> shuffling.
>>>
>>> Basically, one part of the question is when is disk used internally?
>>>
>>> And is calling persist() on the RDD returned by such transformations
>>> what let's it know to use disk in those situations? Trying to 
>>> understand if
>>> persist() is applied during the transformation or after it.
>>>
>>> Thank you.
>>>
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v elos.io
>>> W: www.velos.io
>>>
>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v elos.io
>> W: www.velos.io
>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP T

Re: Spark Disk Usage

2014-04-09 Thread Surendranauth Hiraman
Yes, MEMORY_AND_DISK.

We do a groupByKey and then call persist on the resulting RDD. So I'm
wondering if groupByKey is aware of the subsequent persist setting to use
disk or just creates the Seq[V] in memory and only uses disk after that
data structure is fully realized in memory.

-Suren



On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash  wrote:

> Which persistence level are you talking about? MEMORY_AND_DISK ?
>
> Sent from my mobile phone
> On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" 
> wrote:
>
>> Thanks, Andrew. That helps.
>>
>> For 1, it sounds like the data for the RDD is held in memory and then
>> only written to disk after the entire RDD has been realized in memory. Is
>> that correct?
>>
>> -Suren
>>
>>
>>
>> On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash  wrote:
>>
>>> For 1, persist can be used to save an RDD to disk using the various
>>> persistence levels.  When a persistency level is set on an RDD, when that
>>> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
>>> re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
>>> use the cached value.
>>>
>>>
>>> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence
>>>
>>> 2. The other places disk is used most commonly is shuffles.  If you have
>>> data across the cluster that comes from a source, then you might not have
>>> to hold it all in memory at once.  But if you do a shuffle, which scatters
>>> the data across the cluster in a certain way, then you have to have the
>>> memory/disk available for that RDD all at once.  In that case, shuffles
>>> will sometimes need to spill over to disk for large RDDs, which can be
>>> controlled with the spark.shuffle.spill setting.
>>>
>>> Does that help clarify?
>>>
>>>
>>> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
>>> suren.hira...@velos.io> wrote:
>>>
 It might help if I clarify my questions. :-)

 1. Is persist() applied during the transformation right before the
 persist() call in the graph? Or is is applied after the transform's
 processing is complete? In the case of things like GroupBy, is the Seq
 backed by disk as it is being created? We're trying to get a sense of how
 the processing is handled behind the scenes with respect to disk.

 2. When else is disk used internally?

 Any pointers are appreciated.

 -Suren




 On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
 suren.hira...@velos.io> wrote:

> Hi,
>
> Any thoughts on this? Thanks.
>
> -Suren
>
>
>
> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
> suren.hira...@velos.io> wrote:
>
>> Hi,
>>
>> I know if we call persist with the right options, we can have Spark
>> persist an RDD's data on disk.
>>
>> I am wondering what happens in intermediate operations that could
>> conceivably create large collections/Sequences, like GroupBy and 
>> shuffling.
>>
>> Basically, one part of the question is when is disk used internally?
>>
>> And is calling persist() on the RDD returned by such transformations
>> what let's it know to use disk in those situations? Trying to understand 
>> if
>> persist() is applied during the transformation or after it.
>>
>> Thank you.
>>
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v elos.io
>> W: www.velos.io
>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v elos.io
> W: www.velos.io
>
>


 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v elos.io
 W: www.velos.io


>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v elos.io
>> W: www.velos.io
>>
>>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v elos.io
W: www.velos.io


Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
Which persistence level are you talking about? MEMORY_AND_DISK ?

Sent from my mobile phone
On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" 
wrote:

> Thanks, Andrew. That helps.
>
> For 1, it sounds like the data for the RDD is held in memory and then only
> written to disk after the entire RDD has been realized in memory. Is that
> correct?
>
> -Suren
>
>
>
> On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash  wrote:
>
>> For 1, persist can be used to save an RDD to disk using the various
>> persistence levels.  When a persistency level is set on an RDD, when that
>> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
>> re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
>> use the cached value.
>>
>>
>> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence
>>
>> 2. The other places disk is used most commonly is shuffles.  If you have
>> data across the cluster that comes from a source, then you might not have
>> to hold it all in memory at once.  But if you do a shuffle, which scatters
>> the data across the cluster in a certain way, then you have to have the
>> memory/disk available for that RDD all at once.  In that case, shuffles
>> will sometimes need to spill over to disk for large RDDs, which can be
>> controlled with the spark.shuffle.spill setting.
>>
>> Does that help clarify?
>>
>>
>> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
>> suren.hira...@velos.io> wrote:
>>
>>> It might help if I clarify my questions. :-)
>>>
>>> 1. Is persist() applied during the transformation right before the
>>> persist() call in the graph? Or is is applied after the transform's
>>> processing is complete? In the case of things like GroupBy, is the Seq
>>> backed by disk as it is being created? We're trying to get a sense of how
>>> the processing is handled behind the scenes with respect to disk.
>>>
>>> 2. When else is disk used internally?
>>>
>>> Any pointers are appreciated.
>>>
>>> -Suren
>>>
>>>
>>>
>>>
>>> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
>>> suren.hira...@velos.io> wrote:
>>>
 Hi,

 Any thoughts on this? Thanks.

 -Suren



 On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
 suren.hira...@velos.io> wrote:

> Hi,
>
> I know if we call persist with the right options, we can have Spark
> persist an RDD's data on disk.
>
> I am wondering what happens in intermediate operations that could
> conceivably create large collections/Sequences, like GroupBy and 
> shuffling.
>
> Basically, one part of the question is when is disk used internally?
>
> And is calling persist() on the RDD returned by such transformations
> what let's it know to use disk in those situations? Trying to understand 
> if
> persist() is applied during the transformation or after it.
>
> Thank you.
>
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v elos.io
> W: www.velos.io
>
>


 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v elos.io
 W: www.velos.io


>>>
>>>
>>> --
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v elos.io
>>> W: www.velos.io
>>>
>>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v elos.io
> W: www.velos.io
>
>


Re: Spark Disk Usage

2014-04-09 Thread Surendranauth Hiraman
Thanks, Andrew. That helps.

For 1, it sounds like the data for the RDD is held in memory and then only
written to disk after the entire RDD has been realized in memory. Is that
correct?

-Suren



On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash  wrote:

> For 1, persist can be used to save an RDD to disk using the various
> persistence levels.  When a persistency level is set on an RDD, when that
> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
> re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
> use the cached value.
>
>
> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence
>
> 2. The other places disk is used most commonly is shuffles.  If you have
> data across the cluster that comes from a source, then you might not have
> to hold it all in memory at once.  But if you do a shuffle, which scatters
> the data across the cluster in a certain way, then you have to have the
> memory/disk available for that RDD all at once.  In that case, shuffles
> will sometimes need to spill over to disk for large RDDs, which can be
> controlled with the spark.shuffle.spill setting.
>
> Does that help clarify?
>
>
> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
> suren.hira...@velos.io> wrote:
>
>> It might help if I clarify my questions. :-)
>>
>> 1. Is persist() applied during the transformation right before the
>> persist() call in the graph? Or is is applied after the transform's
>> processing is complete? In the case of things like GroupBy, is the Seq
>> backed by disk as it is being created? We're trying to get a sense of how
>> the processing is handled behind the scenes with respect to disk.
>>
>> 2. When else is disk used internally?
>>
>> Any pointers are appreciated.
>>
>> -Suren
>>
>>
>>
>>
>> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
>> suren.hira...@velos.io> wrote:
>>
>>> Hi,
>>>
>>> Any thoughts on this? Thanks.
>>>
>>> -Suren
>>>
>>>
>>>
>>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
>>> suren.hira...@velos.io> wrote:
>>>
 Hi,

 I know if we call persist with the right options, we can have Spark
 persist an RDD's data on disk.

 I am wondering what happens in intermediate operations that could
 conceivably create large collections/Sequences, like GroupBy and shuffling.

 Basically, one part of the question is when is disk used internally?

 And is calling persist() on the RDD returned by such transformations
 what let's it know to use disk in those situations? Trying to understand if
 persist() is applied during the transformation or after it.

 Thank you.


 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v elos.io
 W: www.velos.io


>>>
>>>
>>> --
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v elos.io
>>> W: www.velos.io
>>>
>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v elos.io
>> W: www.velos.io
>>
>>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v elos.io
W: www.velos.io


Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
For 1, persist can be used to save an RDD to disk using the various
persistence levels.  When a persistency level is set on an RDD, when that
RDD is evaluated it's saved to memory/disk/elsewhere so that it can be
re-used.  It's applied to that RDD, so that subsequent uses of the RDD can
use the cached value.

https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence

2. The other places disk is used most commonly is shuffles.  If you have
data across the cluster that comes from a source, then you might not have
to hold it all in memory at once.  But if you do a shuffle, which scatters
the data across the cluster in a certain way, then you have to have the
memory/disk available for that RDD all at once.  In that case, shuffles
will sometimes need to spill over to disk for large RDDs, which can be
controlled with the spark.shuffle.spill setting.

Does that help clarify?


On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:

> It might help if I clarify my questions. :-)
>
> 1. Is persist() applied during the transformation right before the
> persist() call in the graph? Or is is applied after the transform's
> processing is complete? In the case of things like GroupBy, is the Seq
> backed by disk as it is being created? We're trying to get a sense of how
> the processing is handled behind the scenes with respect to disk.
>
> 2. When else is disk used internally?
>
> Any pointers are appreciated.
>
> -Suren
>
>
>
>
> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
> suren.hira...@velos.io> wrote:
>
>> Hi,
>>
>> Any thoughts on this? Thanks.
>>
>> -Suren
>>
>>
>>
>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
>> suren.hira...@velos.io> wrote:
>>
>>> Hi,
>>>
>>> I know if we call persist with the right options, we can have Spark
>>> persist an RDD's data on disk.
>>>
>>> I am wondering what happens in intermediate operations that could
>>> conceivably create large collections/Sequences, like GroupBy and shuffling.
>>>
>>> Basically, one part of the question is when is disk used internally?
>>>
>>> And is calling persist() on the RDD returned by such transformations
>>> what let's it know to use disk in those situations? Trying to understand if
>>> persist() is applied during the transformation or after it.
>>>
>>> Thank you.
>>>
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v elos.io
>>> W: www.velos.io
>>>
>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v elos.io
>> W: www.velos.io
>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v elos.io
> W: www.velos.io
>
>


Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
It might help if I clarify my questions. :-)

1. Is persist() applied during the transformation right before the
persist() call in the graph? Or is is applied after the transform's
processing is complete? In the case of things like GroupBy, is the Seq
backed by disk as it is being created? We're trying to get a sense of how
the processing is handled behind the scenes with respect to disk.

2. When else is disk used internally?

Any pointers are appreciated.

-Suren




On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:

> Hi,
>
> Any thoughts on this? Thanks.
>
> -Suren
>
>
>
> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
> suren.hira...@velos.io> wrote:
>
>> Hi,
>>
>> I know if we call persist with the right options, we can have Spark
>> persist an RDD's data on disk.
>>
>> I am wondering what happens in intermediate operations that could
>> conceivably create large collections/Sequences, like GroupBy and shuffling.
>>
>> Basically, one part of the question is when is disk used internally?
>>
>> And is calling persist() on the RDD returned by such transformations what
>> let's it know to use disk in those situations? Trying to understand if
>> persist() is applied during the transformation or after it.
>>
>> Thank you.
>>
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v elos.io
>> W: www.velos.io
>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v elos.io
> W: www.velos.io
>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v elos.io
W: www.velos.io


Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
Hi,

Any thoughts on this? Thanks.

-Suren



On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:

> Hi,
>
> I know if we call persist with the right options, we can have Spark
> persist an RDD's data on disk.
>
> I am wondering what happens in intermediate operations that could
> conceivably create large collections/Sequences, like GroupBy and shuffling.
>
> Basically, one part of the question is when is disk used internally?
>
> And is calling persist() on the RDD returned by such transformations what
> let's it know to use disk in those situations? Trying to understand if
> persist() is applied during the transformation or after it.
>
> Thank you.
>
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v elos.io
> W: www.velos.io
>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v elos.io
W: www.velos.io


Spark Disk Usage

2014-04-03 Thread Surendranauth Hiraman
Hi,

I know if we call persist with the right options, we can have Spark persist
an RDD's data on disk.

I am wondering what happens in intermediate operations that could
conceivably create large collections/Sequences, like GroupBy and shuffling.

Basically, one part of the question is when is disk used internally?

And is calling persist() on the RDD returned by such transformations what
let's it know to use disk in those situations? Trying to understand if
persist() is applied during the transformation or after it.

Thank you.


SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v elos.io
W: www.velos.io