Value of SHUFFLE_PARTITIONS

Chirag Aggarwal Mon, 01 Sep 2014 05:03:54 -0700

Hi,

Currently the number of shuffle partitions is config driven parameter 
(SHUFFLE_PARTITIONS) . This means that anyone who is running a spark-sql query 
should first of
all analyze that what value of SHUFFLE_PARTITIONS would give the best 
performance for the query.


Shouldn't there be a logic in SparkSql which should be able to figure out the 
best value and also provide a mechanism to give preference to user specified 
value.
This I believe can be worked out on the basis of number of partitions in the 
original data.

I ran some queries and with default value (200) of shuffle-partitioning, and 
when I changed this value to 5, the time taken by the query reduced by nearly 
35%.

Thanks,
Chirag

Value of SHUFFLE_PARTITIONS

Reply via email to