Spark 1.2
Data is read from parquet with 2 partitions and is cached as table with 2
partitions. Verified in UI that it shows RDD with 2 partitions & it is
fully cached in memory

Cached data contains column a, b, c. Column a has ~150 distinct values.

Next run SQL on this table as "select a, sum(b), sum(c) from table x"

The query creates 200 tasks. Further, the advanced metric "scheduler delay"
is significant % for most of these tasks. This seems very high overhead for
query on RDD with 2 partitions

It seems if this is run with less number of task, the query should run
faster ?

Any thoughts on how to control # of partitions for the group by (or other
SQLs) ?

Thanks,

Reply via email to