Hi Jens, Within a partition things will spill - so the current documentation is correct. This spilling can only occur *across keys* at the moment. Spilling cannot occur within a key at present.
This is discussed in the video here: https://www.youtube.com/watch?v=dmL0N3qfSc8&index=3&list=PL-x35fyliRwj7qNxXLgMRJaOk7o9inHBZ Spilling within one key for GroupBy's is likely to end up in the next release of Spark, Spark 1.2. In most cases we see when users hit this, they are actually trying to just do aggregations which would be more efficiently implemented without the groupBy operator. If the goal is literally to just write out to disk all the values associated with each group, and the values associated with a single group are larger than fit in memory, this cannot be accomplished right now with the groupBy operator. The best way to work around this depends a bit on what you are trying to do with the data down stream. Typically approaches involve sub-dividing any very large groups, for instance, appending a hashed value in a small range (1-10) to large keys. Then your downstream code has to deal with aggregating partial values for each group. If your goal is just to lay each group out sequentially on disk on one big file, you can call `sortByKey` with a hashed suffix as well. The sort functions are externalized in Spark 1.1 (which is in pre-release). - Patrick On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti <sp...@jkg.dk> wrote: > Patrick Wendell wrote > > In the latest version of Spark we've added documentation to make this > > distinction more clear to users: > > > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390 > > That is a very good addition to the documentation. Nice and clear about the > "dangers" of groupBy. > > > Patrick Wendell wrote > > Currently groupBy requires that all > > of the values for one key can fit in memory. > > Is that really true? Will partitions not spill to disk, hence the > recommendation in the documentation to up the parallelism of groupBy et al? > > A better question might be: How exactly does partitioning affect groupBy > with regards to memory consumption. What will **have** to fit in memory, > and > what may be spilled to disk, if running out of memory? > > And if it really is true, that Spark requires all groups' values to fit in > memory, how do I do a "on-disk" grouping of results, similar to what I'd to > in a Hadoop job by using a mapper emitting (groupId, value) key-value > pairs, > and having an entity reducer writing results to disk? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11487.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >