Spark spark.shuffle.memoryFraction has no affect

2015-07-21 Thread wdbaruni
Hi I am testing Spark on Amazon EMR using Python and the basic wordcount example shipped with Spark. After running the application, I realized that in Stage 0 reduceByKey(add), around 2.5GB shuffle is spilled to memory and 4GB shuffle is spilled to disk. Since in the wordcount example I am not

Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-21 Thread wdbaruni
I am new to Spark and I understand that Spark divides the executor memory into the following fractions: *RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or .cache() and can be defined by setting spark.storage.memoryFraction (default 0.6) *Shuffle and aggregation buffers:*