Hi Patrick, For the spilling within on key work you mention might land in Spark 1.2, is that being tracked in https://issues.apache.org/jira/browse/SPARK-1823 or is there another ticket I should be following?
Thanks! Andrew On Tue, Aug 5, 2014 at 3:39 PM, Patrick Wendell <pwend...@gmail.com> wrote: > Hi Jens, > > Within a partition things will spill - so the current documentation is > correct. This spilling can only occur *across keys* at the moment. Spilling > cannot occur within a key at present. > > This is discussed in the video here: > > https://www.youtube.com/watch?v=dmL0N3qfSc8&index=3&list=PL-x35fyliRwj7qNxXLgMRJaOk7o9inHBZ > > Spilling within one key for GroupBy's is likely to end up in the next > release of Spark, Spark 1.2. In most cases we see when users hit this, they > are actually trying to just do aggregations which would be more efficiently > implemented without the groupBy operator. > > If the goal is literally to just write out to disk all the values > associated with each group, and the values associated with a single group > are larger than fit in memory, this cannot be accomplished right now with > the groupBy operator. > > The best way to work around this depends a bit on what you are trying to > do with the data down stream. Typically approaches involve sub-dividing any > very large groups, for instance, appending a hashed value in a small range > (1-10) to large keys. Then your downstream code has to deal with > aggregating partial values for each group. If your goal is just to lay each > group out sequentially on disk on one big file, you can call `sortByKey` > with a hashed suffix as well. The sort functions are externalized in Spark > 1.1 (which is in pre-release). > > - Patrick > > > On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti <sp...@jkg.dk> wrote: > >> Patrick Wendell wrote >> > In the latest version of Spark we've added documentation to make this >> > distinction more clear to users: >> > >> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390 >> >> That is a very good addition to the documentation. Nice and clear about >> the >> "dangers" of groupBy. >> >> >> Patrick Wendell wrote >> > Currently groupBy requires that all >> > of the values for one key can fit in memory. >> >> Is that really true? Will partitions not spill to disk, hence the >> recommendation in the documentation to up the parallelism of groupBy et >> al? >> >> A better question might be: How exactly does partitioning affect groupBy >> with regards to memory consumption. What will **have** to fit in memory, >> and >> what may be spilled to disk, if running out of memory? >> >> And if it really is true, that Spark requires all groups' values to fit in >> memory, how do I do a "on-disk" grouping of results, similar to what I'd >> to >> in a Hadoop job by using a mapper emitting (groupId, value) key-value >> pairs, >> and having an entity reducer writing results to disk? >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11487.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >