Hi Andrew

I tried many different combinations, but still no change in the amount of 
shuffle bytes spilled to disk by checking the UI. I made sure the configuration 
have been applied by checking Spark UI/Environment. I only see changes in 
shuffle bytes spilled if I disable spark.shuffle.spill


> On Jul 22, 2015, at 3:15 AM, Andrew Or <and...@databricks.com> wrote:
> 
> Hi,
> 
> The setting of 0.2 / 0.6 looks reasonable to me. Since you are not using 
> caching at all, have you tried trying something more extreme, like 0.1 / 0.9? 
> Since disabling spark.shuffle.spill didn't cause an OOM this setting should 
> be fine. Also, one thing you could do is to verify the shuffle bytes spilled 
> on the UI before and after the change.
> 
> Let me know if that helped.
> -Andrew
> 
> 2015-07-21 13:50 GMT-07:00 wdbaruni <wdbar...@gmail.com 
> <mailto:wdbar...@gmail.com>>:
> Hi
> I am testing Spark on Amazon EMR using Python and the basic wordcount
> example shipped with Spark.
> 
> After running the application, I realized that in Stage 0 reduceByKey(add),
> around 2.5GB shuffle is spilled to memory and 4GB shuffle is spilled to
> disk. Since in the wordcount example I am not caching or persisting any
> data, so I thought I can increase the performance of this application by
> giving more shuffle memoryFraction. So, in spark-defaults.conf, I added the
> following:
> 
> spark.storage.memoryFraction    0.2
> spark.shuffle.memoryFraction    0.6
> 
> However, I am still getting the same performance and the same amount of
> shuffle data is being spilled to disk and memory. I validated that Spark is
> reading these configurations using Spark UI/Environment and I can see my
> changes. Moreover, I tried setting spark.shuffle.spill to false and I got
> the performance I am looking for and all shuffle data was spilled to memory
> only.
> 
> So, what am I getting wrong here and why not the extra shuffle memory
> fraction is not utilized?
> 
> *My environment:*
> Amazon EMR with Spark 1.3.1 running using -x argument
> 1 Master node: m3.xlarge
> 3 Core nodes: m3.xlarge
> Application: wordcount.py
> Input: 10 .gz files 90MB each (~350MB unarchived) stored in S3
> 
> *Submit command:*
> /home/hadoop/spark/bin/spark-submit --deploy-mode client /mnt/wordcount.py
> s3n://<input location>
> 
> *spark-defaults.conf:*
> spark.eventLog.enabled          false
> spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
> spark.driver.extraJavaOptions   -Dspark.driver.log.level=INFO
> spark.master                    yarn
> spark.executor.instances        3
> spark.executor.cores            4
> spark.executor.memory           9404M
> spark.default.parallelism       12
> spark.eventLog.enabled          true
> spark.eventLog.dir              hdfs:///spark-logs/
> spark.storage.memoryFraction    0.2
> spark.shuffle.memoryFraction    0.6
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-spark-shuffle-memoryFraction-has-no-affect-tp23944.html
>  
> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-spark-shuffle-memoryFraction-has-no-affect-tp23944.html>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 

Reply via email to