Hi, Any thoughts on this?
Most of the 200 tasks in collect phase take less than 20 ms and lot of time is spent on scheduling these. I suspect the overall time will reduce if # of tasks are dropped to a much smaller # (close to # of partitions?) Any help is appreciated Thanks On Wed, Feb 4, 2015 at 12:38 PM, Manoj Samel <manojsamelt...@gmail.com> wrote: > Spark 1.2 > Data is read from parquet with 2 partitions and is cached as table with 2 > partitions. Verified in UI that it shows RDD with 2 partitions & it is > fully cached in memory > > Cached data contains column a, b, c. Column a has ~150 distinct values. > > Next run SQL on this table as "select a, sum(b), sum(c) from table x" > > The query creates 200 tasks. Further, the advanced metric "scheduler > delay" is significant % for most of these tasks. This seems very high > overhead for query on RDD with 2 partitions > > It seems if this is run with less number of task, the query should run > faster ? > > Any thoughts on how to control # of partitions for the group by (or other > SQLs) ? > > Thanks, >