Hi Patrick,

For the spilling within on key work you mention might land in Spark 1.2, is
that being tracked in https://issues.apache.org/jira/browse/SPARK-1823 or
is there another ticket I should be following?

Thanks!
Andrew


On Tue, Aug 5, 2014 at 3:39 PM, Patrick Wendell <pwend...@gmail.com> wrote:

> Hi Jens,
>
> Within a partition things will spill - so the current documentation is
> correct. This spilling can only occur *across keys* at the moment. Spilling
> cannot occur within a key at present.
>
> This is discussed in the video here:
>
> https://www.youtube.com/watch?v=dmL0N3qfSc8&index=3&list=PL-x35fyliRwj7qNxXLgMRJaOk7o9inHBZ
>
> Spilling within one key for GroupBy's is likely to end up in the next
> release of Spark, Spark 1.2. In most cases we see when users hit this, they
> are actually trying to just do aggregations which would be more efficiently
> implemented without the groupBy operator.
>
> If the goal is literally to just write out to disk all the values
> associated with each group, and the values associated with a single group
> are larger than fit in memory, this cannot be accomplished right now with
> the groupBy operator.
>
> The best way to work around this depends a bit on what you are trying to
> do with the data down stream. Typically approaches involve sub-dividing any
> very large groups, for instance, appending a hashed value in a small range
> (1-10) to large keys. Then your downstream code has to deal with
> aggregating partial values for each group. If your goal is just to lay each
> group out sequentially on disk on one big file, you can call `sortByKey`
> with a hashed suffix as well. The sort functions are externalized in Spark
> 1.1 (which is in pre-release).
>
> - Patrick
>
>
> On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti <sp...@jkg.dk> wrote:
>
>> Patrick Wendell wrote
>> > In the latest version of Spark we've added documentation to make this
>> > distinction more clear to users:
>> >
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390
>>
>> That is a very good addition to the documentation. Nice and clear about
>> the
>> "dangers" of groupBy.
>>
>>
>> Patrick Wendell wrote
>> > Currently groupBy requires that all
>> > of the values for one key can fit in memory.
>>
>> Is that really true? Will partitions not spill to disk, hence the
>> recommendation in the documentation to up the parallelism of groupBy et
>> al?
>>
>> A better question might be: How exactly does partitioning affect groupBy
>> with regards to memory consumption. What will **have** to fit in memory,
>> and
>> what may be spilled to disk, if running out of memory?
>>
>> And if it really is true, that Spark requires all groups' values to fit in
>> memory, how do I do a "on-disk" grouping of results, similar to what I'd
>> to
>> in a Hadoop job by using a mapper emitting (groupId, value) key-value
>> pairs,
>> and having an entity reducer writing results to disk?
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11487.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to