Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
Spark 1.2 Data is read from parquet with 2 partitions and is cached as table with 2 partitions. Verified in UI that it shows RDD with 2 partitions it is fully cached in memory Cached data contains column a, b, c. Column a has ~150 distinct values. Next run SQL on this table as select a, sum(b),

Re: Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
Follow up for closure on thread ... 1. spark.sql.shuffle.partitions is not on config page but is mentioned on http://spark.apache.org/docs/1.2.0/sql-programming-guide.html. Would be better to have it in config page as well for sake of completeness. Should I file a doc bug ? 2. Regarding my #2

Re: Large # of tasks in groupby on single table

2015-02-04 Thread Ankur Srivastava
Hi Manoj, You can set the number of partitions you want your sql query to use. By default it is 200 and thus you see that number. You can update it using the spark.sql.shuffle.partitions property spark.sql.shuffle.partitions200Configures the number of partitions to use when shuffling data for

Re: Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
Hi, Any thoughts on this? Most of the 200 tasks in collect phase take less than 20 ms and lot of time is spent on scheduling these. I suspect the overall time will reduce if # of tasks are dropped to a much smaller # (close to # of partitions?) Any help is appreciated Thanks On Wed, Feb 4,