Hello, Right now I'm using DataFrames to perform a df1.groupBy(key).count() on one DataFrame and join with another, df2.
The first, df1, is very large (many gigabytes) compared to df2 (250 Mb). Right now I'm running this on a cluster of 5 nodes, 16 cores each, 90 GB RAM each. It is taking me about 1 hour and 40 minutes to perform the groupBy, count, and join, which seems very slow to me. Currently I have set the following in my *spark-defaults.conf*: spark.executor.instances 24 spark.executor.memory 10g spark.executor.cores 3 spark.driver.memory 5g spark.sql.autoBroadcastJoinThreshold 200Mb I have a couple of questions regarding tuning for performance as a beginner. 1. Right now I'm running Spark 1.6.0. Would moving to Spark 2.0 DataSet (or even DataFrames) be better? 2. What if I used RDDs instead? I know that reduceByKey is better than groupByKey, and DataFrames don't have that method. 3. I think I can do a broadcast join and have set a threshold. Do I need to set it above my second DataFrame size? Do I need to explicitly call broadcast(df2)? 4. What's the point of driver memory? 5. Can anyone point out something wrong with my tuning numbers, or any additional parameters worth checking out? Thank you a lot! Sincerely, Jestin