Hello,
Right now I'm using DataFrames to perform a df1.groupBy(key).count() on one
DataFrame and join with another, df2.

The first, df1, is very large (many gigabytes) compared to df2 (250 Mb).

Right now I'm running this on a cluster of 5 nodes, 16 cores each, 90 GB
RAM each.
It is taking me about 1 hour and 40 minutes to perform the groupBy, count,
and join, which seems very slow to me.

Currently I have set the following in my *spark-defaults.conf*:

spark.executor.instances                             24
spark.executor.memory                               10g
spark.executor.cores                                    3
spark.driver.memory                                     5g
spark.sql.autoBroadcastJoinThreshold        200Mb


I have a couple of questions regarding tuning for performance as a beginner.

   1. Right now I'm running Spark 1.6.0. Would moving to Spark 2.0 DataSet
   (or even DataFrames) be better?
   2. What if I used RDDs instead? I know that reduceByKey is better than
   groupByKey, and DataFrames don't have that method.
   3. I think I can do a broadcast join and have set a threshold. Do I need
   to set it above my second DataFrame size? Do I need to explicitly call
   broadcast(df2)?
   4. What's the point of driver memory?
   5. Can anyone point out something wrong with my tuning numbers, or any
   additional parameters worth checking out?


Thank you a lot!
Sincerely,
Jestin

Reply via email to