Hi,

Any thoughts on this?

Most of the 200 tasks in collect phase take less than 20 ms and lot of time
is spent on scheduling these. I suspect the overall time will reduce if #
of tasks are dropped to a much smaller # (close to # of partitions?)

Any help is appreciated

Thanks

On Wed, Feb 4, 2015 at 12:38 PM, Manoj Samel <manojsamelt...@gmail.com>
wrote:

> Spark 1.2
> Data is read from parquet with 2 partitions and is cached as table with 2
> partitions. Verified in UI that it shows RDD with 2 partitions & it is
> fully cached in memory
>
> Cached data contains column a, b, c. Column a has ~150 distinct values.
>
> Next run SQL on this table as "select a, sum(b), sum(c) from table x"
>
> The query creates 200 tasks. Further, the advanced metric "scheduler
> delay" is significant % for most of these tasks. This seems very high
> overhead for query on RDD with 2 partitions
>
> It seems if this is run with less number of task, the query should run
> faster ?
>
> Any thoughts on how to control # of partitions for the group by (or other
> SQLs) ?
>
> Thanks,
>

Reply via email to