Yes, MEMORY_AND_DISK. We do a groupByKey and then call persist on the resulting RDD. So I'm wondering if groupByKey is aware of the subsequent persist setting to use disk or just creates the Seq[V] in memory and only uses disk after that data structure is fully realized in memory.
-Suren On Wed, Apr 9, 2014 at 9:46 AM, Andrew Ash <[email protected]> wrote: > Which persistence level are you talking about? MEMORY_AND_DISK ? > > Sent from my mobile phone > On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" <[email protected]> > wrote: > >> Thanks, Andrew. That helps. >> >> For 1, it sounds like the data for the RDD is held in memory and then >> only written to disk after the entire RDD has been realized in memory. Is >> that correct? >> >> -Suren >> >> >> >> On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash <[email protected]> wrote: >> >>> For 1, persist can be used to save an RDD to disk using the various >>> persistence levels. When a persistency level is set on an RDD, when that >>> RDD is evaluated it's saved to memory/disk/elsewhere so that it can be >>> re-used. It's applied to that RDD, so that subsequent uses of the RDD can >>> use the cached value. >>> >>> >>> https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence >>> >>> 2. The other places disk is used most commonly is shuffles. If you have >>> data across the cluster that comes from a source, then you might not have >>> to hold it all in memory at once. But if you do a shuffle, which scatters >>> the data across the cluster in a certain way, then you have to have the >>> memory/disk available for that RDD all at once. In that case, shuffles >>> will sometimes need to spill over to disk for large RDDs, which can be >>> controlled with the spark.shuffle.spill setting. >>> >>> Does that help clarify? >>> >>> >>> On Mon, Apr 7, 2014 at 10:20 AM, Surendranauth Hiraman < >>> [email protected]> wrote: >>> >>>> It might help if I clarify my questions. :-) >>>> >>>> 1. Is persist() applied during the transformation right before the >>>> persist() call in the graph? Or is is applied after the transform's >>>> processing is complete? In the case of things like GroupBy, is the Seq >>>> backed by disk as it is being created? We're trying to get a sense of how >>>> the processing is handled behind the scenes with respect to disk. >>>> >>>> 2. When else is disk used internally? >>>> >>>> Any pointers are appreciated. >>>> >>>> -Suren >>>> >>>> >>>> >>>> >>>> On Mon, Apr 7, 2014 at 8:46 AM, Surendranauth Hiraman < >>>> [email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> Any thoughts on this? Thanks. >>>>> >>>>> -Suren >>>>> >>>>> >>>>> >>>>> On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I know if we call persist with the right options, we can have Spark >>>>>> persist an RDD's data on disk. >>>>>> >>>>>> I am wondering what happens in intermediate operations that could >>>>>> conceivably create large collections/Sequences, like GroupBy and >>>>>> shuffling. >>>>>> >>>>>> Basically, one part of the question is when is disk used internally? >>>>>> >>>>>> And is calling persist() on the RDD returned by such transformations >>>>>> what let's it know to use disk in those situations? Trying to understand >>>>>> if >>>>>> persist() is applied during the transformation or after it. >>>>>> >>>>>> Thank you. >>>>>> >>>>>> >>>>>> SUREN HIRAMAN, VP TECHNOLOGY >>>>>> Velos >>>>>> Accelerating Machine Learning >>>>>> >>>>>> 440 NINTH AVENUE, 11TH FLOOR >>>>>> NEW YORK, NY 10001 >>>>>> O: (917) 525-2466 ext. 105 >>>>>> F: 646.349.4063 >>>>>> E: suren.hiraman@v <[email protected]>elos.io >>>>>> W: www.velos.io >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> SUREN HIRAMAN, VP TECHNOLOGY >>>>> Velos >>>>> Accelerating Machine Learning >>>>> >>>>> 440 NINTH AVENUE, 11TH FLOOR >>>>> NEW YORK, NY 10001 >>>>> O: (917) 525-2466 ext. 105 >>>>> F: 646.349.4063 >>>>> E: suren.hiraman@v <[email protected]>elos.io >>>>> W: www.velos.io >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> SUREN HIRAMAN, VP TECHNOLOGY >>>> Velos >>>> Accelerating Machine Learning >>>> >>>> 440 NINTH AVENUE, 11TH FLOOR >>>> NEW YORK, NY 10001 >>>> O: (917) 525-2466 ext. 105 >>>> F: 646.349.4063 >>>> E: suren.hiraman@v <[email protected]>elos.io >>>> W: www.velos.io >>>> >>>> >>> >> >> >> -- >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v <[email protected]>elos.io >> W: www.velos.io >> >> -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v <[email protected]>elos.io W: www.velos.io
