Re: Understanding and optimizing spark disk usage during a job.
I may not be correct (in fact I may be completely opposite), but here is my guess: Assuming 8 bytes for double, 4000 vectors of dimension 400 for 12k images, would require 153.6 GB (12k*4000*400*8) of data which may justify the amount of data to be written to the disk. Without compression, it seems it would be using roughly that much data. You can further cross check what is the storage level for your RDDs, default is MEMORY_ONLY. In case it is also spilling data to disk, it would further increase the storage needed. On Fri, Nov 28, 2014 at 10:43 PM, Jaonary Rabarisoa wrote: > Dear all, > > I have a job that crashes before its end because of no space left on > device, and I noticed that this job generates a lots of temporary data on > my disk. > > To be precise, the job is a simple map job that takes a set of images, > extracts local features and save these local features as a sequence file. > My images are represented as a key value pair where the key are strings > representing the id of the image (the filename) and the values are the > base64 encoding of the images. > > To extract the features, I use an external c program that I call with > RDD.pipe. I stream the base64 image to the c program and it sends back the > extracted feature vectors through stdout. Each line represents one feature > vector from the current image. I don't use any serialization library, I > just write the feature vector element on the stdout separated by space. > Once in spark, I just split the line and create a scala vector from each > value and save my sequence file. > > The overall job looks like the following : > > val images: RDD[(String, String) = ... > val features: RDD[(String, Vector)] = images.pipe(...).map(_split(" ")...) > features.saveAsSequenceFile(...) > > The problem is that for about 3G of image data (about 12000 images) this > job generates more than 180G of temporary data. It seems to be strange > since for each image I have about 4000 double feature vectors of dimension > 400. > > I run the job on my laptop for test purpose that why I can't add > additional disk space. By the way, I need to understand why this simple job > generates such a lot of data and how can I reduce this ? > > > Best, > > Jao > > > > -- Regards, Vikas Agarwal 91 – 9928301411 InfoObjects, Inc. Execution Matters http://www.infoobjects.com 2041 Mission College Boulevard, #280 Santa Clara, CA 95054 +1 (408) 988-2000 Work +1 (408) 716-2726 Fax
Understanding and optimizing spark disk usage during a job.
Dear all, I have a job that crashes before its end because of no space left on device, and I noticed that this job generates a lots of temporary data on my disk. To be precise, the job is a simple map job that takes a set of images, extracts local features and save these local features as a sequence file. My images are represented as a key value pair where the key are strings representing the id of the image (the filename) and the values are the base64 encoding of the images. To extract the features, I use an external c program that I call with RDD.pipe. I stream the base64 image to the c program and it sends back the extracted feature vectors through stdout. Each line represents one feature vector from the current image. I don't use any serialization library, I just write the feature vector element on the stdout separated by space. Once in spark, I just split the line and create a scala vector from each value and save my sequence file. The overall job looks like the following : val images: RDD[(String, String) = ... val features: RDD[(String, Vector)] = images.pipe(...).map(_split(" ")...) features.saveAsSequenceFile(...) The problem is that for about 3G of image data (about 12000 images) this job generates more than 180G of temporary data. It seems to be strange since for each image I have about 4000 double feature vectors of dimension 400. I run the job on my laptop for test purpose that why I can't add additional disk space. By the way, I need to understand why this simple job generates such a lot of data and how can I reduce this ? Best, Jao
Re: Spark Disk Usage
Andrew, Thanks a lot for the pointer to the code! This has answered my question. Looks like it tries to write it to memory first and then if it doesn't fit, it spills to disk. I'll have to dig in more to figure out the details. -Suren On Wed, Apr 9, 2014 at 12:46 PM, Andrew Ash wrote: > The groupByKey would be aware of the subsequent persist -- that's part of > the reason why operations are lazy. As far as whether it's materialized in > memory first and then flushed to disk vs streamed to disk I'm not sure the > exact behavior. > > What I'd expect to happen would be that the RDD is materialized in memory > up until it fills up the BlockManager. At that point it starts spilling > blocks out to disk in order to keep from OOM'ing. I'm not sure if new > blocks go straight to disk or if the BlockManager pages already-existing > blocks out in order to make room for new blocks. > > You can always read through source to figure it out though! > > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L588 > > > > > On Wed, Apr 9, 2014 at 6:52 AM, Surendranauth Hiraman < > suren.hira...@velos.io> wrote: > >> Yes, MEMORY_AND_DISK. >> >> We do a groupByKey and then call persist on the resulting RDD. So I'm >> wondering if groupByKey is aware of the subsequent persist setting to use >> disk or just creates the Seq[V] in memory and only uses disk after that >> data structure is fully realized in memory. >> >> -Suren >> >> >> >> On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash wrote: >> >>> Which persistence level are you talking about? MEMORY_AND_DISK ? >>> >>> Sent from my mobile phone >>> On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" >>> wrote: >>> Thanks, Andrew. That helps. For 1, it sounds like the data for the RDD is held in memory and then only written to disk after the entire RDD has been realized in memory. Is that correct? -Suren On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash wrote: > For 1, persist can be used to save an RDD to disk using the various > persistence levels. When a persistency level is set on an RDD, when that > RDD is evaluated it's saved to memory/disk/elsewhere so that it can be > re-used. It's applied to that RDD, so that subsequent uses of the RDD can > use the cached value. > > > https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence > > 2. The other places disk is used most commonly is shuffles. If you > have data across the cluster that comes from a source, then you might not > have to hold it all in memory at once. But if you do a shuffle, which > scatters the data across the cluster in a certain way, then you have to > have the memory/disk available for that RDD all at once. In that case, > shuffles will sometimes need to spill over to disk for large RDDs, which > can be controlled with the spark.shuffle.spill setting. > > Does that help clarify? > > > On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman < > suren.hira...@velos.io> wrote: > >> It might help if I clarify my questions. :-) >> >> 1. Is persist() applied during the transformation right before the >> persist() call in the graph? Or is is applied after the transform's >> processing is complete? In the case of things like GroupBy, is the Seq >> backed by disk as it is being created? We're trying to get a sense of how >> the processing is handled behind the scenes with respect to disk. >> >> 2. When else is disk used internally? >> >> Any pointers are appreciated. >> >> -Suren >> >> >> >> >> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman < >> suren.hira...@velos.io> wrote: >> >>> Hi, >>> >>> Any thoughts on this? Thanks. >>> >>> -Suren >>> >>> >>> >>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman < >>> suren.hira...@velos.io> wrote: >>> Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations that could conceivably create large collections/Sequences, like GroupBy and shuffling. Basically, one part of the question is when is disk used internally? And is calling persist() on the RDD returned by such transformations what let's it know to use disk in those situations? Trying to understand if persist() is applied during the transformation or after it. Thank you. SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646
Re: Spark Disk Usage
The groupByKey would be aware of the subsequent persist -- that's part of the reason why operations are lazy. As far as whether it's materialized in memory first and then flushed to disk vs streamed to disk I'm not sure the exact behavior. What I'd expect to happen would be that the RDD is materialized in memory up until it fills up the BlockManager. At that point it starts spilling blocks out to disk in order to keep from OOM'ing. I'm not sure if new blocks go straight to disk or if the BlockManager pages already-existing blocks out in order to make room for new blocks. You can always read through source to figure it out though! https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L588 On Wed, Apr 9, 2014 at 6:52 AM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > Yes, MEMORY_AND_DISK. > > We do a groupByKey and then call persist on the resulting RDD. So I'm > wondering if groupByKey is aware of the subsequent persist setting to use > disk or just creates the Seq[V] in memory and only uses disk after that > data structure is fully realized in memory. > > -Suren > > > > On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash wrote: > >> Which persistence level are you talking about? MEMORY_AND_DISK ? >> >> Sent from my mobile phone >> On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" >> wrote: >> >>> Thanks, Andrew. That helps. >>> >>> For 1, it sounds like the data for the RDD is held in memory and then >>> only written to disk after the entire RDD has been realized in memory. Is >>> that correct? >>> >>> -Suren >>> >>> >>> >>> On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash wrote: >>> For 1, persist can be used to save an RDD to disk using the various persistence levels. When a persistency level is set on an RDD, when that RDD is evaluated it's saved to memory/disk/elsewhere so that it can be re-used. It's applied to that RDD, so that subsequent uses of the RDD can use the cached value. https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence 2. The other places disk is used most commonly is shuffles. If you have data across the cluster that comes from a source, then you might not have to hold it all in memory at once. But if you do a shuffle, which scatters the data across the cluster in a certain way, then you have to have the memory/disk available for that RDD all at once. In that case, shuffles will sometimes need to spill over to disk for large RDDs, which can be controlled with the spark.shuffle.spill setting. Does that help clarify? On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > It might help if I clarify my questions. :-) > > 1. Is persist() applied during the transformation right before the > persist() call in the graph? Or is is applied after the transform's > processing is complete? In the case of things like GroupBy, is the Seq > backed by disk as it is being created? We're trying to get a sense of how > the processing is handled behind the scenes with respect to disk. > > 2. When else is disk used internally? > > Any pointers are appreciated. > > -Suren > > > > > On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman < > suren.hira...@velos.io> wrote: > >> Hi, >> >> Any thoughts on this? Thanks. >> >> -Suren >> >> >> >> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman < >> suren.hira...@velos.io> wrote: >> >>> Hi, >>> >>> I know if we call persist with the right options, we can have Spark >>> persist an RDD's data on disk. >>> >>> I am wondering what happens in intermediate operations that could >>> conceivably create large collections/Sequences, like GroupBy and >>> shuffling. >>> >>> Basically, one part of the question is when is disk used internally? >>> >>> And is calling persist() on the RDD returned by such transformations >>> what let's it know to use disk in those situations? Trying to >>> understand if >>> persist() is applied during the transformation or after it. >>> >>> Thank you. >>> >>> >>> SUREN HIRAMAN, VP TECHNOLOGY >>> Velos >>> Accelerating Machine Learning >>> >>> 440 NINTH AVENUE, 11TH FLOOR >>> NEW YORK, NY 10001 >>> O: (917) 525-2466 ext. 105 >>> F: 646.349.4063 >>> E: suren.hiraman@v elos.io >>> W: www.velos.io >>> >>> >> >> >> -- >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v elos.io >> W: www.velos.io >> >> > > > -- > > SUREN HIRAMAN, VP T
Re: Spark Disk Usage
Yes, MEMORY_AND_DISK. We do a groupByKey and then call persist on the resulting RDD. So I'm wondering if groupByKey is aware of the subsequent persist setting to use disk or just creates the Seq[V] in memory and only uses disk after that data structure is fully realized in memory. -Suren On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash wrote: > Which persistence level are you talking about? MEMORY_AND_DISK ? > > Sent from my mobile phone > On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" > wrote: > >> Thanks, Andrew. That helps. >> >> For 1, it sounds like the data for the RDD is held in memory and then >> only written to disk after the entire RDD has been realized in memory. Is >> that correct? >> >> -Suren >> >> >> >> On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash wrote: >> >>> For 1, persist can be used to save an RDD to disk using the various >>> persistence levels. When a persistency level is set on an RDD, when that >>> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be >>> re-used. It's applied to that RDD, so that subsequent uses of the RDD can >>> use the cached value. >>> >>> >>> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence >>> >>> 2. The other places disk is used most commonly is shuffles. If you have >>> data across the cluster that comes from a source, then you might not have >>> to hold it all in memory at once. But if you do a shuffle, which scatters >>> the data across the cluster in a certain way, then you have to have the >>> memory/disk available for that RDD all at once. In that case, shuffles >>> will sometimes need to spill over to disk for large RDDs, which can be >>> controlled with the spark.shuffle.spill setting. >>> >>> Does that help clarify? >>> >>> >>> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman < >>> suren.hira...@velos.io> wrote: >>> It might help if I clarify my questions. :-) 1. Is persist() applied during the transformation right before the persist() call in the graph? Or is is applied after the transform's processing is complete? In the case of things like GroupBy, is the Seq backed by disk as it is being created? We're trying to get a sense of how the processing is handled behind the scenes with respect to disk. 2. When else is disk used internally? Any pointers are appreciated. -Suren On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > Hi, > > Any thoughts on this? Thanks. > > -Suren > > > > On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman < > suren.hira...@velos.io> wrote: > >> Hi, >> >> I know if we call persist with the right options, we can have Spark >> persist an RDD's data on disk. >> >> I am wondering what happens in intermediate operations that could >> conceivably create large collections/Sequences, like GroupBy and >> shuffling. >> >> Basically, one part of the question is when is disk used internally? >> >> And is calling persist() on the RDD returned by such transformations >> what let's it know to use disk in those situations? Trying to understand >> if >> persist() is applied during the transformation or after it. >> >> Thank you. >> >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v elos.io >> W: www.velos.io >> >> > > > -- > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v elos.io > W: www.velos.io > > -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v elos.io W: www.velos.io >>> >> >> >> -- >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v elos.io >> W: www.velos.io >> >> -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v elos.io W: www.velos.io
Re: Spark Disk Usage
Which persistence level are you talking about? MEMORY_AND_DISK ? Sent from my mobile phone On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" wrote: > Thanks, Andrew. That helps. > > For 1, it sounds like the data for the RDD is held in memory and then only > written to disk after the entire RDD has been realized in memory. Is that > correct? > > -Suren > > > > On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash wrote: > >> For 1, persist can be used to save an RDD to disk using the various >> persistence levels. When a persistency level is set on an RDD, when that >> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be >> re-used. It's applied to that RDD, so that subsequent uses of the RDD can >> use the cached value. >> >> >> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence >> >> 2. The other places disk is used most commonly is shuffles. If you have >> data across the cluster that comes from a source, then you might not have >> to hold it all in memory at once. But if you do a shuffle, which scatters >> the data across the cluster in a certain way, then you have to have the >> memory/disk available for that RDD all at once. In that case, shuffles >> will sometimes need to spill over to disk for large RDDs, which can be >> controlled with the spark.shuffle.spill setting. >> >> Does that help clarify? >> >> >> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman < >> suren.hira...@velos.io> wrote: >> >>> It might help if I clarify my questions. :-) >>> >>> 1. Is persist() applied during the transformation right before the >>> persist() call in the graph? Or is is applied after the transform's >>> processing is complete? In the case of things like GroupBy, is the Seq >>> backed by disk as it is being created? We're trying to get a sense of how >>> the processing is handled behind the scenes with respect to disk. >>> >>> 2. When else is disk used internally? >>> >>> Any pointers are appreciated. >>> >>> -Suren >>> >>> >>> >>> >>> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman < >>> suren.hira...@velos.io> wrote: >>> Hi, Any thoughts on this? Thanks. -Suren On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > Hi, > > I know if we call persist with the right options, we can have Spark > persist an RDD's data on disk. > > I am wondering what happens in intermediate operations that could > conceivably create large collections/Sequences, like GroupBy and > shuffling. > > Basically, one part of the question is when is disk used internally? > > And is calling persist() on the RDD returned by such transformations > what let's it know to use disk in those situations? Trying to understand > if > persist() is applied during the transformation or after it. > > Thank you. > > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v elos.io > W: www.velos.io > > -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v elos.io W: www.velos.io >>> >>> >>> -- >>> >>> SUREN HIRAMAN, VP TECHNOLOGY >>> Velos >>> Accelerating Machine Learning >>> >>> 440 NINTH AVENUE, 11TH FLOOR >>> NEW YORK, NY 10001 >>> O: (917) 525-2466 ext. 105 >>> F: 646.349.4063 >>> E: suren.hiraman@v elos.io >>> W: www.velos.io >>> >>> >> > > > -- > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v elos.io > W: www.velos.io > >
Re: Spark Disk Usage
Thanks, Andrew. That helps. For 1, it sounds like the data for the RDD is held in memory and then only written to disk after the entire RDD has been realized in memory. Is that correct? -Suren On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash wrote: > For 1, persist can be used to save an RDD to disk using the various > persistence levels. When a persistency level is set on an RDD, when that > RDD is evaluated it's saved to memory/disk/elsewhere so that it can be > re-used. It's applied to that RDD, so that subsequent uses of the RDD can > use the cached value. > > > https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence > > 2. The other places disk is used most commonly is shuffles. If you have > data across the cluster that comes from a source, then you might not have > to hold it all in memory at once. But if you do a shuffle, which scatters > the data across the cluster in a certain way, then you have to have the > memory/disk available for that RDD all at once. In that case, shuffles > will sometimes need to spill over to disk for large RDDs, which can be > controlled with the spark.shuffle.spill setting. > > Does that help clarify? > > > On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman < > suren.hira...@velos.io> wrote: > >> It might help if I clarify my questions. :-) >> >> 1. Is persist() applied during the transformation right before the >> persist() call in the graph? Or is is applied after the transform's >> processing is complete? In the case of things like GroupBy, is the Seq >> backed by disk as it is being created? We're trying to get a sense of how >> the processing is handled behind the scenes with respect to disk. >> >> 2. When else is disk used internally? >> >> Any pointers are appreciated. >> >> -Suren >> >> >> >> >> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman < >> suren.hira...@velos.io> wrote: >> >>> Hi, >>> >>> Any thoughts on this? Thanks. >>> >>> -Suren >>> >>> >>> >>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman < >>> suren.hira...@velos.io> wrote: >>> Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations that could conceivably create large collections/Sequences, like GroupBy and shuffling. Basically, one part of the question is when is disk used internally? And is calling persist() on the RDD returned by such transformations what let's it know to use disk in those situations? Trying to understand if persist() is applied during the transformation or after it. Thank you. SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v elos.io W: www.velos.io >>> >>> >>> -- >>> >>> SUREN HIRAMAN, VP TECHNOLOGY >>> Velos >>> Accelerating Machine Learning >>> >>> 440 NINTH AVENUE, 11TH FLOOR >>> NEW YORK, NY 10001 >>> O: (917) 525-2466 ext. 105 >>> F: 646.349.4063 >>> E: suren.hiraman@v elos.io >>> W: www.velos.io >>> >>> >> >> >> -- >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v elos.io >> W: www.velos.io >> >> > -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v elos.io W: www.velos.io
Re: Spark Disk Usage
For 1, persist can be used to save an RDD to disk using the various persistence levels. When a persistency level is set on an RDD, when that RDD is evaluated it's saved to memory/disk/elsewhere so that it can be re-used. It's applied to that RDD, so that subsequent uses of the RDD can use the cached value. https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence 2. The other places disk is used most commonly is shuffles. If you have data across the cluster that comes from a source, then you might not have to hold it all in memory at once. But if you do a shuffle, which scatters the data across the cluster in a certain way, then you have to have the memory/disk available for that RDD all at once. In that case, shuffles will sometimes need to spill over to disk for large RDDs, which can be controlled with the spark.shuffle.spill setting. Does that help clarify? On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > It might help if I clarify my questions. :-) > > 1. Is persist() applied during the transformation right before the > persist() call in the graph? Or is is applied after the transform's > processing is complete? In the case of things like GroupBy, is the Seq > backed by disk as it is being created? We're trying to get a sense of how > the processing is handled behind the scenes with respect to disk. > > 2. When else is disk used internally? > > Any pointers are appreciated. > > -Suren > > > > > On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman < > suren.hira...@velos.io> wrote: > >> Hi, >> >> Any thoughts on this? Thanks. >> >> -Suren >> >> >> >> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman < >> suren.hira...@velos.io> wrote: >> >>> Hi, >>> >>> I know if we call persist with the right options, we can have Spark >>> persist an RDD's data on disk. >>> >>> I am wondering what happens in intermediate operations that could >>> conceivably create large collections/Sequences, like GroupBy and shuffling. >>> >>> Basically, one part of the question is when is disk used internally? >>> >>> And is calling persist() on the RDD returned by such transformations >>> what let's it know to use disk in those situations? Trying to understand if >>> persist() is applied during the transformation or after it. >>> >>> Thank you. >>> >>> >>> SUREN HIRAMAN, VP TECHNOLOGY >>> Velos >>> Accelerating Machine Learning >>> >>> 440 NINTH AVENUE, 11TH FLOOR >>> NEW YORK, NY 10001 >>> O: (917) 525-2466 ext. 105 >>> F: 646.349.4063 >>> E: suren.hiraman@v elos.io >>> W: www.velos.io >>> >>> >> >> >> -- >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v elos.io >> W: www.velos.io >> >> > > > -- > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v elos.io > W: www.velos.io > >
Re: Spark Disk Usage
It might help if I clarify my questions. :-) 1. Is persist() applied during the transformation right before the persist() call in the graph? Or is is applied after the transform's processing is complete? In the case of things like GroupBy, is the Seq backed by disk as it is being created? We're trying to get a sense of how the processing is handled behind the scenes with respect to disk. 2. When else is disk used internally? Any pointers are appreciated. -Suren On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > Hi, > > Any thoughts on this? Thanks. > > -Suren > > > > On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman < > suren.hira...@velos.io> wrote: > >> Hi, >> >> I know if we call persist with the right options, we can have Spark >> persist an RDD's data on disk. >> >> I am wondering what happens in intermediate operations that could >> conceivably create large collections/Sequences, like GroupBy and shuffling. >> >> Basically, one part of the question is when is disk used internally? >> >> And is calling persist() on the RDD returned by such transformations what >> let's it know to use disk in those situations? Trying to understand if >> persist() is applied during the transformation or after it. >> >> Thank you. >> >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v elos.io >> W: www.velos.io >> >> > > > -- > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v elos.io > W: www.velos.io > > -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v elos.io W: www.velos.io
Re: Spark Disk Usage
Hi, Any thoughts on this? Thanks. -Suren On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > Hi, > > I know if we call persist with the right options, we can have Spark > persist an RDD's data on disk. > > I am wondering what happens in intermediate operations that could > conceivably create large collections/Sequences, like GroupBy and shuffling. > > Basically, one part of the question is when is disk used internally? > > And is calling persist() on the RDD returned by such transformations what > let's it know to use disk in those situations? Trying to understand if > persist() is applied during the transformation or after it. > > Thank you. > > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v elos.io > W: www.velos.io > > -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v elos.io W: www.velos.io
Spark Disk Usage
Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations that could conceivably create large collections/Sequences, like GroupBy and shuffling. Basically, one part of the question is when is disk used internally? And is calling persist() on the RDD returned by such transformations what let's it know to use disk in those situations? Trying to understand if persist() is applied during the transformation or after it. Thank you. SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v elos.io W: www.velos.io