Follow up for closure on thread ...

1. spark.sql.shuffle.partitions is not on config page but is mentioned on
http://spark.apache.org/docs/1.2.0/sql-programming-guide.html. Would be
better to have it in config page as well for sake of completeness. Should I
file a doc bug ?
2. Regarding my #2 above (Spark should auto-determining # of tasks), there
is already a write up on SQL Programming page in Hive optimizations not in
Spark "Automatically determine the number of reducers for joins and
groupbys: Currently in Spark SQL, you need to control the degree of
parallelism post-shuffle using “SET
spark.sql.shuffle.partitions=[num_tasks];”. Any idea if and when this is
scheduled ? Even a rudimentary implementation (e.g. based on # of
partitions of underlying RDD, which is available now) would be a
improvement over current fixed 200 and would be a critical feature for
SparkSQL feasibility

Thanks

On Wed, Feb 4, 2015 at 4:09 PM, Manoj Samel <manojsamelt...@gmail.com>
wrote:

> Awesome ! By setting this, I could minimize the collect overhead, e.g by
> setting it to # of partitions of the RDD.
>
> Two questions
>
> 1. I had looked for such option in
> http://spark.apache.org/docs/latest/configuration.html but this is not
> documented. Seems this a doc. bug ?
> 2. Ideally the shuffle partitions should be derive from underlying
> table(s) and a optimal number should be set for each query. Having one
> number across all queries is not ideal, nor do the consumer wants to set it
> before each query to different #. Any thoughts ?
>
>
> Thanks !
>
> On Wed, Feb 4, 2015 at 3:50 PM, Ankur Srivastava <
> ankur.srivast...@gmail.com> wrote:
>
>> Hi Manoj,
>>
>> You can set the number of partitions you want your sql query to use. By
>> default it is 200 and thus you see that number. You can update it using the
>> spark.sql.shuffle.partitions property
>>
>> spark.sql.shuffle.partitions200Configures the number of partitions to
>> use when shuffling data for joins or aggregations.
>>
>> Thanks
>> Ankur
>>
>> On Wed, Feb 4, 2015 at 3:41 PM, Manoj Samel <manojsamelt...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Any thoughts on this?
>>>
>>> Most of the 200 tasks in collect phase take less than 20 ms and lot of
>>> time is spent on scheduling these. I suspect the overall time will reduce
>>> if # of tasks are dropped to a much smaller # (close to # of partitions?)
>>>
>>> Any help is appreciated
>>>
>>> Thanks
>>>
>>> On Wed, Feb 4, 2015 at 12:38 PM, Manoj Samel <manojsamelt...@gmail.com>
>>> wrote:
>>>
>>>> Spark 1.2
>>>> Data is read from parquet with 2 partitions and is cached as table with
>>>> 2 partitions. Verified in UI that it shows RDD with 2 partitions & it is
>>>> fully cached in memory
>>>>
>>>> Cached data contains column a, b, c. Column a has ~150 distinct values.
>>>>
>>>> Next run SQL on this table as "select a, sum(b), sum(c) from table x"
>>>>
>>>> The query creates 200 tasks. Further, the advanced metric "scheduler
>>>> delay" is significant % for most of these tasks. This seems very high
>>>> overhead for query on RDD with 2 partitions
>>>>
>>>> It seems if this is run with less number of task, the query should run
>>>> faster ?
>>>>
>>>> Any thoughts on how to control # of partitions for the group by (or
>>>> other SQLs) ?
>>>>
>>>> Thanks,
>>>>
>>>
>>>
>>
>

Reply via email to