I am seeking help with a Spark configuration running queries against a cluster of 6 machines. Each machine has Spark 1.5.1 with slaves started on 6 and 1 acting as master/thriftserver. I query from Beeline 2 tables that have 300M and 31M rows respectively. Results from my queries thus far return up to 500M rows when queried using Oracle but Spark errors at anything more than 5.5M rows.
I believe there is an optimal memory configuration that must be set for each of the workers in our cluster but I have not been able to determine that setting. Is there something better than trial and error? Are there settings to avoid such as making sure not to set spark.driver.maxResultSize > spark.driver.memory? Is there a formula or guidelines by which to calculate the correct Spark configuration values when given a machines available cores and memory resources? This is my current configuration: BDA v3 server : SUN SERVER X4-2L Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz CPU cores : 32 GB of memory (>=63): 63 number of disks : 12 spark-defaults.conf spark.driver.memory 20g spark.executor.memory 40g spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps spark.rpc.askTimeout 6000s spark.rpc.lookupTimeout 3000s spark.driver.maxResultSize 20g spark.rdd.compress true spark.storage.memoryFraction 1 spark.core.connection.ack.wait.timeout 600 spark.akka.frameSize 500 spark.shuffle.compress true spark.shuffle.file.buffer 128k spark.shuffle.memoryFraction 0 spark.shuffle.spill.compress true spark.shuffle.spill true Thank you, Chris -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-defaults-conf-optimal-configuration-tp25641.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org