Hi,

Currently the number of shuffle partitions is config driven parameter 
(SHUFFLE_PARTITIONS) . This means that anyone who is running a spark-sql query 
should first of
all analyze that what value of SHUFFLE_PARTITIONS would give the best 
performance for the query.

Shouldn't there be a logic in SparkSql which should be able to figure out the 
best value and also provide a mechanism to give preference to user specified 
value.
This I believe can be worked out on the basis of number of partitions in the 
original data.

I ran some queries and with default value (200) of shuffle-partitioning, and 
when I changed this value to 5, the time taken by the query reduced by nearly 
35%.

Thanks,
Chirag

Reply via email to