Hi Manoj,

You can set the number of partitions you want your sql query to use. By
default it is 200 and thus you see that number. You can update it using the
spark.sql.shuffle.partitions property

spark.sql.shuffle.partitions200Configures the number of partitions to use
when shuffling data for joins or aggregations.

Thanks
Ankur

On Wed, Feb 4, 2015 at 3:41 PM, Manoj Samel <manojsamelt...@gmail.com>
wrote:

> Hi,
>
> Any thoughts on this?
>
> Most of the 200 tasks in collect phase take less than 20 ms and lot of
> time is spent on scheduling these. I suspect the overall time will reduce
> if # of tasks are dropped to a much smaller # (close to # of partitions?)
>
> Any help is appreciated
>
> Thanks
>
> On Wed, Feb 4, 2015 at 12:38 PM, Manoj Samel <manojsamelt...@gmail.com>
> wrote:
>
>> Spark 1.2
>> Data is read from parquet with 2 partitions and is cached as table with 2
>> partitions. Verified in UI that it shows RDD with 2 partitions & it is
>> fully cached in memory
>>
>> Cached data contains column a, b, c. Column a has ~150 distinct values.
>>
>> Next run SQL on this table as "select a, sum(b), sum(c) from table x"
>>
>> The query creates 200 tasks. Further, the advanced metric "scheduler
>> delay" is significant % for most of these tasks. This seems very high
>> overhead for query on RDD with 2 partitions
>>
>> It seems if this is run with less number of task, the query should run
>> faster ?
>>
>> Any thoughts on how to control # of partitions for the group by (or other
>> SQLs) ?
>>
>> Thanks,
>>
>
>

Reply via email to