Patrick Wendell wrote > In the latest version of Spark we've added documentation to make this > distinction more clear to users: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390
That is a very good addition to the documentation. Nice and clear about the "dangers" of groupBy. Patrick Wendell wrote > Currently groupBy requires that all > of the values for one key can fit in memory. Is that really true? Will partitions not spill to disk, hence the recommendation in the documentation to up the parallelism of groupBy et al? A better question might be: How exactly does partitioning affect groupBy with regards to memory consumption. What will **have** to fit in memory, and what may be spilled to disk, if running out of memory? And if it really is true, that Spark requires all groups' values to fit in memory, how do I do a "on-disk" grouping of results, similar to what I'd to in a Hadoop job by using a mapper emitting (groupId, value) key-value pairs, and having an entity reducer writing results to disk? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11487.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org