Patrick Wendell wrote
> In the latest version of Spark we've added documentation to make this
> distinction more clear to users:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390

That is a very good addition to the documentation. Nice and clear about the
"dangers" of groupBy.


Patrick Wendell wrote
> Currently groupBy requires that all
> of the values for one key can fit in memory. 

Is that really true? Will partitions not spill to disk, hence the
recommendation in the documentation to up the parallelism of groupBy et al?

A better question might be: How exactly does partitioning affect groupBy
with regards to memory consumption. What will **have** to fit in memory, and
what may be spilled to disk, if running out of memory? 

And if it really is true, that Spark requires all groups' values to fit in
memory, how do I do a "on-disk" grouping of results, similar to what I'd to
in a Hadoop job by using a mapper emitting (groupId, value) key-value pairs,
and having an entity reducer writing results to disk?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11487.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to