Patrick Wendell wrote
> In the latest version of Spark we've added documentation to make this
> distinction more clear to users:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390
That is a very good addition to the documentation. Nic
I'm doing a simple groupBy on a fairly small dataset (80 files in HDFS, few
gigs in total, line based, 500-2000 chars per line). I'm running Spark on 8
low-memory machines in a yarn cluster, i.e. something along the lines of:
spark-submit ... --master yarn-client --num-executors 8
--executor-me