Hi Jens,

Within a partition things will spill - so the current documentation is
correct. This spilling can only occur *across keys* at the moment. Spilling
cannot occur within a key at present.

This is discussed in the video here:
https://www.youtube.com/watch?v=dmL0N3qfSc8&index=3&list=PL-x35fyliRwj7qNxXLgMRJaOk7o9inHBZ

Spilling within one key for GroupBy's is likely to end up in the next
release of Spark, Spark 1.2. In most cases we see when users hit this, they
are actually trying to just do aggregations which would be more efficiently
implemented without the groupBy operator.

If the goal is literally to just write out to disk all the values
associated with each group, and the values associated with a single group
are larger than fit in memory, this cannot be accomplished right now with
the groupBy operator.

The best way to work around this depends a bit on what you are trying to do
with the data down stream. Typically approaches involve sub-dividing any
very large groups, for instance, appending a hashed value in a small range
(1-10) to large keys. Then your downstream code has to deal with
aggregating partial values for each group. If your goal is just to lay each
group out sequentially on disk on one big file, you can call `sortByKey`
with a hashed suffix as well. The sort functions are externalized in Spark
1.1 (which is in pre-release).

- Patrick


On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti <sp...@jkg.dk> wrote:

> Patrick Wendell wrote
> > In the latest version of Spark we've added documentation to make this
> > distinction more clear to users:
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390
>
> That is a very good addition to the documentation. Nice and clear about the
> "dangers" of groupBy.
>
>
> Patrick Wendell wrote
> > Currently groupBy requires that all
> > of the values for one key can fit in memory.
>
> Is that really true? Will partitions not spill to disk, hence the
> recommendation in the documentation to up the parallelism of groupBy et al?
>
> A better question might be: How exactly does partitioning affect groupBy
> with regards to memory consumption. What will **have** to fit in memory,
> and
> what may be spilled to disk, if running out of memory?
>
> And if it really is true, that Spark requires all groups' values to fit in
> memory, how do I do a "on-disk" grouping of results, similar to what I'd to
> in a Hadoop job by using a mapper emitting (groupId, value) key-value
> pairs,
> and having an entity reducer writing results to disk?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11487.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to