Spark 1.2 Data is read from parquet with 2 partitions and is cached as table with 2 partitions. Verified in UI that it shows RDD with 2 partitions & it is fully cached in memory
Cached data contains column a, b, c. Column a has ~150 distinct values. Next run SQL on this table as "select a, sum(b), sum(c) from table x" The query creates 200 tasks. Further, the advanced metric "scheduler delay" is significant % for most of these tasks. This seems very high overhead for query on RDD with 2 partitions It seems if this is run with less number of task, the query should run faster ? Any thoughts on how to control # of partitions for the group by (or other SQLs) ? Thanks,