Hi, Currently the number of shuffle partitions is config driven parameter (SHUFFLE_PARTITIONS) . This means that anyone who is running a spark-sql query should first of all analyze that what value of SHUFFLE_PARTITIONS would give the best performance for the query.
Shouldn't there be a logic in SparkSql which should be able to figure out the best value and also provide a mechanism to give preference to user specified value. This I believe can be worked out on the basis of number of partitions in the original data. I ran some queries and with default value (200) of shuffle-partitioning, and when I changed this value to 5, the time taken by the query reduced by nearly 35%. Thanks, Chirag