+1 I had the exact same question as I am working on my first Spark applications. Hope someone can share some best practices.
Thanks! Isabelle On Tue, Sep 1, 2015 at 2:17 AM, Romi Kuntsman <r...@totango.com> wrote: > Hi all, > > The number of partition greatly affect the speed and efficiency of > calculation, in my case in DataFrames/SparkSQL on Spark 1.4.0. > > Too few partitions with large data cause OOM exceptions. > Too many partitions on small data cause a delay due to overhead. > > How do you programmatically determine the optimal number of partitions and > cores in Spark, as a function of: > > 1. available memory per core > 2. number of records in input data > 3. average/maximum record size > 4. cache configuration > 5. shuffle configuration > 6. serialization > 7. etc? > > Any general best practices? > > Thanks! > > Romi K. >