Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
Spark 1.2
Data is read from parquet with 2 partitions and is cached as table with 2
partitions. Verified in UI that it shows RDD with 2 partitions  it is
fully cached in memory

Cached data contains column a, b, c. Column a has ~150 distinct values.

Next run SQL on this table as select a, sum(b), sum(c) from table x

The query creates 200 tasks. Further, the advanced metric scheduler delay
is significant % for most of these tasks. This seems very high overhead for
query on RDD with 2 partitions

It seems if this is run with less number of task, the query should run
faster ?

Any thoughts on how to control # of partitions for the group by (or other
SQLs) ?

Thanks,


Re: Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
Follow up for closure on thread ...

1. spark.sql.shuffle.partitions is not on config page but is mentioned on
http://spark.apache.org/docs/1.2.0/sql-programming-guide.html. Would be
better to have it in config page as well for sake of completeness. Should I
file a doc bug ?
2. Regarding my #2 above (Spark should auto-determining # of tasks), there
is already a write up on SQL Programming page in Hive optimizations not in
Spark Automatically determine the number of reducers for joins and
groupbys: Currently in Spark SQL, you need to control the degree of
parallelism post-shuffle using “SET
spark.sql.shuffle.partitions=[num_tasks];”. Any idea if and when this is
scheduled ? Even a rudimentary implementation (e.g. based on # of
partitions of underlying RDD, which is available now) would be a
improvement over current fixed 200 and would be a critical feature for
SparkSQL feasibility

Thanks

On Wed, Feb 4, 2015 at 4:09 PM, Manoj Samel manojsamelt...@gmail.com
wrote:

 Awesome ! By setting this, I could minimize the collect overhead, e.g by
 setting it to # of partitions of the RDD.

 Two questions

 1. I had looked for such option in
 http://spark.apache.org/docs/latest/configuration.html but this is not
 documented. Seems this a doc. bug ?
 2. Ideally the shuffle partitions should be derive from underlying
 table(s) and a optimal number should be set for each query. Having one
 number across all queries is not ideal, nor do the consumer wants to set it
 before each query to different #. Any thoughts ?


 Thanks !

 On Wed, Feb 4, 2015 at 3:50 PM, Ankur Srivastava 
 ankur.srivast...@gmail.com wrote:

 Hi Manoj,

 You can set the number of partitions you want your sql query to use. By
 default it is 200 and thus you see that number. You can update it using the
 spark.sql.shuffle.partitions property

 spark.sql.shuffle.partitions200Configures the number of partitions to
 use when shuffling data for joins or aggregations.

 Thanks
 Ankur

 On Wed, Feb 4, 2015 at 3:41 PM, Manoj Samel manojsamelt...@gmail.com
 wrote:

 Hi,

 Any thoughts on this?

 Most of the 200 tasks in collect phase take less than 20 ms and lot of
 time is spent on scheduling these. I suspect the overall time will reduce
 if # of tasks are dropped to a much smaller # (close to # of partitions?)

 Any help is appreciated

 Thanks

 On Wed, Feb 4, 2015 at 12:38 PM, Manoj Samel manojsamelt...@gmail.com
 wrote:

 Spark 1.2
 Data is read from parquet with 2 partitions and is cached as table with
 2 partitions. Verified in UI that it shows RDD with 2 partitions  it is
 fully cached in memory

 Cached data contains column a, b, c. Column a has ~150 distinct values.

 Next run SQL on this table as select a, sum(b), sum(c) from table x

 The query creates 200 tasks. Further, the advanced metric scheduler
 delay is significant % for most of these tasks. This seems very high
 overhead for query on RDD with 2 partitions

 It seems if this is run with less number of task, the query should run
 faster ?

 Any thoughts on how to control # of partitions for the group by (or
 other SQLs) ?

 Thanks,







Re: Large # of tasks in groupby on single table

2015-02-04 Thread Ankur Srivastava
Hi Manoj,

You can set the number of partitions you want your sql query to use. By
default it is 200 and thus you see that number. You can update it using the
spark.sql.shuffle.partitions property

spark.sql.shuffle.partitions200Configures the number of partitions to use
when shuffling data for joins or aggregations.

Thanks
Ankur

On Wed, Feb 4, 2015 at 3:41 PM, Manoj Samel manojsamelt...@gmail.com
wrote:

 Hi,

 Any thoughts on this?

 Most of the 200 tasks in collect phase take less than 20 ms and lot of
 time is spent on scheduling these. I suspect the overall time will reduce
 if # of tasks are dropped to a much smaller # (close to # of partitions?)

 Any help is appreciated

 Thanks

 On Wed, Feb 4, 2015 at 12:38 PM, Manoj Samel manojsamelt...@gmail.com
 wrote:

 Spark 1.2
 Data is read from parquet with 2 partitions and is cached as table with 2
 partitions. Verified in UI that it shows RDD with 2 partitions  it is
 fully cached in memory

 Cached data contains column a, b, c. Column a has ~150 distinct values.

 Next run SQL on this table as select a, sum(b), sum(c) from table x

 The query creates 200 tasks. Further, the advanced metric scheduler
 delay is significant % for most of these tasks. This seems very high
 overhead for query on RDD with 2 partitions

 It seems if this is run with less number of task, the query should run
 faster ?

 Any thoughts on how to control # of partitions for the group by (or other
 SQLs) ?

 Thanks,





Re: Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
Hi,

Any thoughts on this?

Most of the 200 tasks in collect phase take less than 20 ms and lot of time
is spent on scheduling these. I suspect the overall time will reduce if #
of tasks are dropped to a much smaller # (close to # of partitions?)

Any help is appreciated

Thanks

On Wed, Feb 4, 2015 at 12:38 PM, Manoj Samel manojsamelt...@gmail.com
wrote:

 Spark 1.2
 Data is read from parquet with 2 partitions and is cached as table with 2
 partitions. Verified in UI that it shows RDD with 2 partitions  it is
 fully cached in memory

 Cached data contains column a, b, c. Column a has ~150 distinct values.

 Next run SQL on this table as select a, sum(b), sum(c) from table x

 The query creates 200 tasks. Further, the advanced metric scheduler
 delay is significant % for most of these tasks. This seems very high
 overhead for query on RDD with 2 partitions

 It seems if this is run with less number of task, the query should run
 faster ?

 Any thoughts on how to control # of partitions for the group by (or other
 SQLs) ?

 Thanks,