Hi all, The number of partition greatly affect the speed and efficiency of calculation, in my case in DataFrames/SparkSQL on Spark 1.4.0.
Too few partitions with large data cause OOM exceptions. Too many partitions on small data cause a delay due to overhead. How do you programmatically determine the optimal number of partitions and cores in Spark, as a function of: 1. available memory per core 2. number of records in input data 3. average/maximum record size 4. cache configuration 5. shuffle configuration 6. serialization 7. etc? Any general best practices? Thanks! Romi K.