Hi Andrew I tried many different combinations, but still no change in the amount of shuffle bytes spilled to disk by checking the UI. I made sure the configuration have been applied by checking Spark UI/Environment. I only see changes in shuffle bytes spilled if I disable spark.shuffle.spill
> On Jul 22, 2015, at 3:15 AM, Andrew Or <and...@databricks.com> wrote: > > Hi, > > The setting of 0.2 / 0.6 looks reasonable to me. Since you are not using > caching at all, have you tried trying something more extreme, like 0.1 / 0.9? > Since disabling spark.shuffle.spill didn't cause an OOM this setting should > be fine. Also, one thing you could do is to verify the shuffle bytes spilled > on the UI before and after the change. > > Let me know if that helped. > -Andrew > > 2015-07-21 13:50 GMT-07:00 wdbaruni <wdbar...@gmail.com > <mailto:wdbar...@gmail.com>>: > Hi > I am testing Spark on Amazon EMR using Python and the basic wordcount > example shipped with Spark. > > After running the application, I realized that in Stage 0 reduceByKey(add), > around 2.5GB shuffle is spilled to memory and 4GB shuffle is spilled to > disk. Since in the wordcount example I am not caching or persisting any > data, so I thought I can increase the performance of this application by > giving more shuffle memoryFraction. So, in spark-defaults.conf, I added the > following: > > spark.storage.memoryFraction 0.2 > spark.shuffle.memoryFraction 0.6 > > However, I am still getting the same performance and the same amount of > shuffle data is being spilled to disk and memory. I validated that Spark is > reading these configurations using Spark UI/Environment and I can see my > changes. Moreover, I tried setting spark.shuffle.spill to false and I got > the performance I am looking for and all shuffle data was spilled to memory > only. > > So, what am I getting wrong here and why not the extra shuffle memory > fraction is not utilized? > > *My environment:* > Amazon EMR with Spark 1.3.1 running using -x argument > 1 Master node: m3.xlarge > 3 Core nodes: m3.xlarge > Application: wordcount.py > Input: 10 .gz files 90MB each (~350MB unarchived) stored in S3 > > *Submit command:* > /home/hadoop/spark/bin/spark-submit --deploy-mode client /mnt/wordcount.py > s3n://<input location> > > *spark-defaults.conf:* > spark.eventLog.enabled false > spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 > spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO > spark.master yarn > spark.executor.instances 3 > spark.executor.cores 4 > spark.executor.memory 9404M > spark.default.parallelism 12 > spark.eventLog.enabled true > spark.eventLog.dir hdfs:///spark-logs/ > spark.storage.memoryFraction 0.2 > spark.shuffle.memoryFraction 0.6 > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-spark-shuffle-memoryFraction-has-no-affect-tp23944.html > > <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-spark-shuffle-memoryFraction-has-no-affect-tp23944.html> > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > For additional commands, e-mail: user-h...@spark.apache.org > <mailto:user-h...@spark.apache.org> > >