Follow up for closure on thread ... 1. spark.sql.shuffle.partitions is not on config page but is mentioned on http://spark.apache.org/docs/1.2.0/sql-programming-guide.html. Would be better to have it in config page as well for sake of completeness. Should I file a doc bug ? 2. Regarding my #2 above (Spark should auto-determining # of tasks), there is already a write up on SQL Programming page in Hive optimizations not in Spark "Automatically determine the number of reducers for joins and groupbys: Currently in Spark SQL, you need to control the degree of parallelism post-shuffle using “SET spark.sql.shuffle.partitions=[num_tasks];”. Any idea if and when this is scheduled ? Even a rudimentary implementation (e.g. based on # of partitions of underlying RDD, which is available now) would be a improvement over current fixed 200 and would be a critical feature for SparkSQL feasibility
Thanks On Wed, Feb 4, 2015 at 4:09 PM, Manoj Samel <manojsamelt...@gmail.com> wrote: > Awesome ! By setting this, I could minimize the collect overhead, e.g by > setting it to # of partitions of the RDD. > > Two questions > > 1. I had looked for such option in > http://spark.apache.org/docs/latest/configuration.html but this is not > documented. Seems this a doc. bug ? > 2. Ideally the shuffle partitions should be derive from underlying > table(s) and a optimal number should be set for each query. Having one > number across all queries is not ideal, nor do the consumer wants to set it > before each query to different #. Any thoughts ? > > > Thanks ! > > On Wed, Feb 4, 2015 at 3:50 PM, Ankur Srivastava < > ankur.srivast...@gmail.com> wrote: > >> Hi Manoj, >> >> You can set the number of partitions you want your sql query to use. By >> default it is 200 and thus you see that number. You can update it using the >> spark.sql.shuffle.partitions property >> >> spark.sql.shuffle.partitions200Configures the number of partitions to >> use when shuffling data for joins or aggregations. >> >> Thanks >> Ankur >> >> On Wed, Feb 4, 2015 at 3:41 PM, Manoj Samel <manojsamelt...@gmail.com> >> wrote: >> >>> Hi, >>> >>> Any thoughts on this? >>> >>> Most of the 200 tasks in collect phase take less than 20 ms and lot of >>> time is spent on scheduling these. I suspect the overall time will reduce >>> if # of tasks are dropped to a much smaller # (close to # of partitions?) >>> >>> Any help is appreciated >>> >>> Thanks >>> >>> On Wed, Feb 4, 2015 at 12:38 PM, Manoj Samel <manojsamelt...@gmail.com> >>> wrote: >>> >>>> Spark 1.2 >>>> Data is read from parquet with 2 partitions and is cached as table with >>>> 2 partitions. Verified in UI that it shows RDD with 2 partitions & it is >>>> fully cached in memory >>>> >>>> Cached data contains column a, b, c. Column a has ~150 distinct values. >>>> >>>> Next run SQL on this table as "select a, sum(b), sum(c) from table x" >>>> >>>> The query creates 200 tasks. Further, the advanced metric "scheduler >>>> delay" is significant % for most of these tasks. This seems very high >>>> overhead for query on RDD with 2 partitions >>>> >>>> It seems if this is run with less number of task, the query should run >>>> faster ? >>>> >>>> Any thoughts on how to control # of partitions for the group by (or >>>> other SQLs) ? >>>> >>>> Thanks, >>>> >>> >>> >> >