On 6 January 2017 at 03:48, Andrew Gierth <and...@tao11.riddles.org.uk> wrote:
> Herewith a patch for doing grouping sets via hashing or mixed hashing
> and sorting.
>
> The principal objective is to pick whatever combination of grouping sets
> has an estimated size that fits in work_mem, and minimizes the number of
> sorting passes we need to do over the data, and hash those.  (Yes, this
> is a knapsack problem.)
>
> This is achieved by adding another strategy to the Agg node; AGG_MIXED
> means that both hashing and (sorted or plain) grouping happens.  In
> addition, support for multiple hash tables in AGG_HASHED mode is added.
>
> Sample plans look like this:
>
> explain select two,ten from onek group by cube(two,ten);
>                           QUERY PLAN
> --------------------------------------------------------------
>  MixedAggregate  (cost=0.00..89.33 rows=33 width=8)
>    Hash Key: two, ten
>    Hash Key: two
>    Hash Key: ten
>    Group Key: ()
>    ->  Seq Scan on onek  (cost=0.00..79.00 rows=1000 width=8)
>
> explain select two,ten from onek group by two, cube(ten,twenty);
>                           QUERY PLAN
> ---------------------------------------------------------------
>  HashAggregate  (cost=86.50..100.62 rows=162 width=12)
>    Hash Key: two, ten, twenty
>    Hash Key: two, ten
>    Hash Key: two
>    Hash Key: two, twenty
>    ->  Seq Scan on onek  (cost=0.00..79.00 rows=1000 width=12)
>
> set work_mem='64kB';
> explain select count(*) from tenk1
>   group by grouping sets (unique1,thousand,hundred,ten,two);
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  MixedAggregate  (cost=1535.39..3023.89 rows=11112 width=28)
>    Hash Key: two
>    Hash Key: ten
>    Hash Key: hundred
>    Group Key: unique1
>    Sort Key: thousand
>      Group Key: thousand
>    ->  Sort  (cost=1535.39..1560.39 rows=10000 width=20)
>          Sort Key: unique1
>          ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=20)
> (10 rows)
>
> Columns with unhashable or unsortable data types are handled
> appropriately, though obviously every individual grouping set must end
> up containing compatible column types.
>
> There is one current weakness which I don't see a good solution for: the
> planner code still has to pick a single value for group_pathkeys before
> planning the input path.  This means that we sometimes can't choose a
> minimal set of sorts, because we may have chosen a sort order for a
> grouping set that should have been hashed, for example:
>
> explain select count(*) from tenk1
>   group by grouping sets (two,unique1,thousand,hundred,ten);
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  MixedAggregate  (cost=1535.39..4126.28 rows=11112 width=28)
>    Hash Key: ten
>    Hash Key: hundred
>    Group Key: two
>    Sort Key: unique1
>      Group Key: unique1
>    Sort Key: thousand
>      Group Key: thousand
>    ->  Sort  (cost=1535.39..1560.39 rows=10000 width=20)
>          Sort Key: two
>          ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=20)
>
> (3 sorts, vs. 2 in the previous example that listed the same grouping
> sets in different order)
>
> Current status of the work: probably still needs cleanup, maybe some
> better regression tests, and Andres has expressed some concern over the
> performance impact of additional complexity in advance_aggregates; it
> would be useful if this could be tested.  But it should be reviewable
> as-is.

This doesn't apply cleanly to latest master.  Could you please post a
rebased patch?

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to