Re: How to determine the value for spark.sql.shuffle.partitions?

Isabelle Phan Thu, 03 Sep 2015 20:21:04 -0700

+1

I had the exact same question as I am working on my first Spark
applications.
Hope someone can share some best practices.


Thanks!

Isabelle

On Tue, Sep 1, 2015 at 2:17 AM, Romi Kuntsman <r...@totango.com> wrote:

> Hi all,
>
> The number of partition greatly affect the speed and efficiency of
> calculation, in my case in DataFrames/SparkSQL on Spark 1.4.0.
>
> Too few partitions with large data cause OOM exceptions.
> Too many partitions on small data cause a delay due to overhead.
>
> How do you programmatically determine the optimal number of partitions and
> cores in Spark, as a function of:
>
>    1. available memory per core
>    2. number of records in input data
>    3. average/maximum record size
>    4. cache configuration
>    5. shuffle configuration
>    6. serialization
>    7. etc?
>
> Any general best practices?
>
> Thanks!
>
> Romi K.
>

Re: How to determine the value for spark.sql.shuffle.partitions?

Reply via email to