Hey Andrew, We might create a new JIRA for it, but it doesn't exist yet. We'll create JIRA's for the major 1.2 issues at the beginning of September.
- Patrick On Mon, Aug 25, 2014 at 8:53 AM, Andrew Ash <and...@andrewash.com> wrote: > Hi Patrick, > > For the spilling within on key work you mention might land in Spark 1.2, > is that being tracked in https://issues.apache.org/jira/browse/SPARK-1823 > or is there another ticket I should be following? > > Thanks! > Andrew > > > On Tue, Aug 5, 2014 at 3:39 PM, Patrick Wendell <pwend...@gmail.com> > wrote: > >> Hi Jens, >> >> Within a partition things will spill - so the current documentation is >> correct. This spilling can only occur *across keys* at the moment. Spilling >> cannot occur within a key at present. >> >> This is discussed in the video here: >> >> https://www.youtube.com/watch?v=dmL0N3qfSc8&index=3&list=PL-x35fyliRwj7qNxXLgMRJaOk7o9inHBZ >> >> Spilling within one key for GroupBy's is likely to end up in the next >> release of Spark, Spark 1.2. In most cases we see when users hit this, they >> are actually trying to just do aggregations which would be more efficiently >> implemented without the groupBy operator. >> >> If the goal is literally to just write out to disk all the values >> associated with each group, and the values associated with a single group >> are larger than fit in memory, this cannot be accomplished right now with >> the groupBy operator. >> >> The best way to work around this depends a bit on what you are trying to >> do with the data down stream. Typically approaches involve sub-dividing any >> very large groups, for instance, appending a hashed value in a small range >> (1-10) to large keys. Then your downstream code has to deal with >> aggregating partial values for each group. If your goal is just to lay each >> group out sequentially on disk on one big file, you can call `sortByKey` >> with a hashed suffix as well. The sort functions are externalized in Spark >> 1.1 (which is in pre-release). >> >> - Patrick >> >> >> On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti <sp...@jkg.dk> wrote: >> >>> Patrick Wendell wrote >>> > In the latest version of Spark we've added documentation to make this >>> > distinction more clear to users: >>> > >>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390 >>> >>> That is a very good addition to the documentation. Nice and clear about >>> the >>> "dangers" of groupBy. >>> >>> >>> Patrick Wendell wrote >>> > Currently groupBy requires that all >>> > of the values for one key can fit in memory. >>> >>> Is that really true? Will partitions not spill to disk, hence the >>> recommendation in the documentation to up the parallelism of groupBy et >>> al? >>> >>> A better question might be: How exactly does partitioning affect groupBy >>> with regards to memory consumption. What will **have** to fit in memory, >>> and >>> what may be spilled to disk, if running out of memory? >>> >>> And if it really is true, that Spark requires all groups' values to fit >>> in >>> memory, how do I do a "on-disk" grouping of results, similar to what I'd >>> to >>> in a Hadoop job by using a mapper emitting (groupId, value) key-value >>> pairs, >>> and having an entity reducer writing results to disk? >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11487.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >