Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-05 Thread Jens Kristian Geyti
Patrick Wendell wrote > In the latest version of Spark we've added documentation to make this > distinction more clear to users: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390 That is a very good addition to the documentation. Nic

Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-05 Thread Jens Kristian Geyti
I'm doing a simple groupBy on a fairly small dataset (80 files in HDFS, few gigs in total, line based, 500-2000 chars per line). I'm running Spark on 8 low-memory machines in a yarn cluster, i.e. something along the lines of: spark-submit ... --master yarn-client --num-executors 8 --executor-me