Which persistence level are you talking about? MEMORY_AND_DISK ? Sent from my mobile phone On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" <suren.hira...@velos.io> wrote:
> Thanks, Andrew. That helps. > > For 1, it sounds like the data for the RDD is held in memory and then only > written to disk after the entire RDD has been realized in memory. Is that > correct? > > -Suren > > > > On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash <and...@andrewash.com> wrote: > >> For 1, persist can be used to save an RDD to disk using the various >> persistence levels. When a persistency level is set on an RDD, when that >> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be >> re-used. It's applied to that RDD, so that subsequent uses of the RDD can >> use the cached value. >> >> >> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence >> >> 2. The other places disk is used most commonly is shuffles. If you have >> data across the cluster that comes from a source, then you might not have >> to hold it all in memory at once. But if you do a shuffle, which scatters >> the data across the cluster in a certain way, then you have to have the >> memory/disk available for that RDD all at once. In that case, shuffles >> will sometimes need to spill over to disk for large RDDs, which can be >> controlled with the spark.shuffle.spill setting. >> >> Does that help clarify? >> >> >> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman < >> suren.hira...@velos.io> wrote: >> >>> It might help if I clarify my questions. :-) >>> >>> 1. Is persist() applied during the transformation right before the >>> persist() call in the graph? Or is is applied after the transform's >>> processing is complete? In the case of things like GroupBy, is the Seq >>> backed by disk as it is being created? We're trying to get a sense of how >>> the processing is handled behind the scenes with respect to disk. >>> >>> 2. When else is disk used internally? >>> >>> Any pointers are appreciated. >>> >>> -Suren >>> >>> >>> >>> >>> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman < >>> suren.hira...@velos.io> wrote: >>> >>>> Hi, >>>> >>>> Any thoughts on this? Thanks. >>>> >>>> -Suren >>>> >>>> >>>> >>>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman < >>>> suren.hira...@velos.io> wrote: >>>> >>>>> Hi, >>>>> >>>>> I know if we call persist with the right options, we can have Spark >>>>> persist an RDD's data on disk. >>>>> >>>>> I am wondering what happens in intermediate operations that could >>>>> conceivably create large collections/Sequences, like GroupBy and >>>>> shuffling. >>>>> >>>>> Basically, one part of the question is when is disk used internally? >>>>> >>>>> And is calling persist() on the RDD returned by such transformations >>>>> what let's it know to use disk in those situations? Trying to understand >>>>> if >>>>> persist() is applied during the transformation or after it. >>>>> >>>>> Thank you. >>>>> >>>>> >>>>> SUREN HIRAMAN, VP TECHNOLOGY >>>>> Velos >>>>> Accelerating Machine Learning >>>>> >>>>> 440 NINTH AVENUE, 11TH FLOOR >>>>> NEW YORK, NY 10001 >>>>> O: (917) 525-2466 ext. 105 >>>>> F: 646.349.4063 >>>>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io >>>>> W: www.velos.io >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> SUREN HIRAMAN, VP TECHNOLOGY >>>> Velos >>>> Accelerating Machine Learning >>>> >>>> 440 NINTH AVENUE, 11TH FLOOR >>>> NEW YORK, NY 10001 >>>> O: (917) 525-2466 ext. 105 >>>> F: 646.349.4063 >>>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io >>>> W: www.velos.io >>>> >>>> >>> >>> >>> -- >>> >>> SUREN HIRAMAN, VP TECHNOLOGY >>> Velos >>> Accelerating Machine Learning >>> >>> 440 NINTH AVENUE, 11TH FLOOR >>> NEW YORK, NY 10001 >>> O: (917) 525-2466 ext. 105 >>> F: 646.349.4063 >>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io >>> W: www.velos.io >>> >>> >> > > > -- > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io > W: www.velos.io > >