Hi
I am testing Spark on Amazon EMR using Python and the basic wordcount
example shipped with Spark.
After running the application, I realized that in Stage 0 reduceByKey(add),
around 2.5GB shuffle is spilled to memory and 4GB shuffle is spilled to
disk. Since in the wordcount example I am not
I am new to Spark and I understand that Spark divides the executor memory
into the following fractions:
*RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or
.cache() and can be defined by setting spark.storage.memoryFraction (default
0.6)
*Shuffle and aggregation buffers:*