How to determine the value for spark.sql.shuffle.partitions?

Romi Kuntsman Tue, 01 Sep 2015 02:18:16 -0700

Hi all,

The number of partition greatly affect the speed and efficiency of
calculation, in my case in DataFrames/SparkSQL on Spark 1.4.0.


Too few partitions with large data cause OOM exceptions.
Too many partitions on small data cause a delay due to overhead.

How do you programmatically determine the optimal number of partitions and
cores in Spark, as a function of:

   1. available memory per core
   2. number of records in input data
   3. average/maximum record size
   4. cache configuration
   5. shuffle configuration
   6. serialization
   7. etc?

Any general best practices?

Thanks!

Romi K.

How to determine the value for spark.sql.shuffle.partitions?

Reply via email to