Re: Understanding RDD.GroupBy OutOfMemory Exceptions

Patrick Wendell Mon, 25 Aug 2014 09:48:14 -0700

Hey Andrew,

We might create a new JIRA for it, but it doesn't exist yet. We'll create
JIRA's for the major 1.2 issues at the beginning of September.


- Patrick


On Mon, Aug 25, 2014 at 8:53 AM, Andrew Ash <and...@andrewash.com> wrote:

> Hi Patrick,
>
> For the spilling within on key work you mention might land in Spark 1.2,
> is that being tracked in https://issues.apache.org/jira/browse/SPARK-1823
> or is there another ticket I should be following?
>
> Thanks!
> Andrew
>
>
> On Tue, Aug 5, 2014 at 3:39 PM, Patrick Wendell <pwend...@gmail.com>
> wrote:
>
>> Hi Jens,
>>
>> Within a partition things will spill - so the current documentation is
>> correct. This spilling can only occur *across keys* at the moment. Spilling
>> cannot occur within a key at present.
>>
>> This is discussed in the video here:
>>
>> https://www.youtube.com/watch?v=dmL0N3qfSc8&index=3&list=PL-x35fyliRwj7qNxXLgMRJaOk7o9inHBZ
>>
>> Spilling within one key for GroupBy's is likely to end up in the next
>> release of Spark, Spark 1.2. In most cases we see when users hit this, they
>> are actually trying to just do aggregations which would be more efficiently
>> implemented without the groupBy operator.
>>
>> If the goal is literally to just write out to disk all the values
>> associated with each group, and the values associated with a single group
>> are larger than fit in memory, this cannot be accomplished right now with
>> the groupBy operator.
>>
>> The best way to work around this depends a bit on what you are trying to
>> do with the data down stream. Typically approaches involve sub-dividing any
>> very large groups, for instance, appending a hashed value in a small range
>> (1-10) to large keys. Then your downstream code has to deal with
>> aggregating partial values for each group. If your goal is just to lay each
>> group out sequentially on disk on one big file, you can call `sortByKey`
>> with a hashed suffix as well. The sort functions are externalized in Spark
>> 1.1 (which is in pre-release).
>>
>> - Patrick
>>
>>
>> On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti <sp...@jkg.dk> wrote:
>>
>>> Patrick Wendell wrote
>>> > In the latest version of Spark we've added documentation to make this
>>> > distinction more clear to users:
>>> >
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390
>>>
>>> That is a very good addition to the documentation. Nice and clear about
>>> the
>>> "dangers" of groupBy.
>>>
>>>
>>> Patrick Wendell wrote
>>> > Currently groupBy requires that all
>>> > of the values for one key can fit in memory.
>>>
>>> Is that really true? Will partitions not spill to disk, hence the
>>> recommendation in the documentation to up the parallelism of groupBy et
>>> al?
>>>
>>> A better question might be: How exactly does partitioning affect groupBy
>>> with regards to memory consumption. What will **have** to fit in memory,
>>> and
>>> what may be spilled to disk, if running out of memory?
>>>
>>> And if it really is true, that Spark requires all groups' values to fit
>>> in
>>> memory, how do I do a "on-disk" grouping of results, similar to what I'd
>>> to
>>> in a Hadoop job by using a mapper emitting (groupId, value) key-value
>>> pairs,
>>> and having an entity reducer writing results to disk?
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11487.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

Reply via email to