On 6 January 2017 at 03:48, Andrew Gierth <and...@tao11.riddles.org.uk> wrote: > Herewith a patch for doing grouping sets via hashing or mixed hashing > and sorting. > > The principal objective is to pick whatever combination of grouping sets > has an estimated size that fits in work_mem, and minimizes the number of > sorting passes we need to do over the data, and hash those. (Yes, this > is a knapsack problem.) > > This is achieved by adding another strategy to the Agg node; AGG_MIXED > means that both hashing and (sorted or plain) grouping happens. In > addition, support for multiple hash tables in AGG_HASHED mode is added. > > Sample plans look like this: > > explain select two,ten from onek group by cube(two,ten); > QUERY PLAN > -------------------------------------------------------------- > MixedAggregate (cost=0.00..89.33 rows=33 width=8) > Hash Key: two, ten > Hash Key: two > Hash Key: ten > Group Key: () > -> Seq Scan on onek (cost=0.00..79.00 rows=1000 width=8) > > explain select two,ten from onek group by two, cube(ten,twenty); > QUERY PLAN > --------------------------------------------------------------- > HashAggregate (cost=86.50..100.62 rows=162 width=12) > Hash Key: two, ten, twenty > Hash Key: two, ten > Hash Key: two > Hash Key: two, twenty > -> Seq Scan on onek (cost=0.00..79.00 rows=1000 width=12) > > set work_mem='64kB'; > explain select count(*) from tenk1 > group by grouping sets (unique1,thousand,hundred,ten,two); > QUERY PLAN > ------------------------------------------------------------------------ > MixedAggregate (cost=1535.39..3023.89 rows=11112 width=28) > Hash Key: two > Hash Key: ten > Hash Key: hundred > Group Key: unique1 > Sort Key: thousand > Group Key: thousand > -> Sort (cost=1535.39..1560.39 rows=10000 width=20) > Sort Key: unique1 > -> Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=20) > (10 rows) > > Columns with unhashable or unsortable data types are handled > appropriately, though obviously every individual grouping set must end > up containing compatible column types. > > There is one current weakness which I don't see a good solution for: the > planner code still has to pick a single value for group_pathkeys before > planning the input path. This means that we sometimes can't choose a > minimal set of sorts, because we may have chosen a sort order for a > grouping set that should have been hashed, for example: > > explain select count(*) from tenk1 > group by grouping sets (two,unique1,thousand,hundred,ten); > QUERY PLAN > ------------------------------------------------------------------------ > MixedAggregate (cost=1535.39..4126.28 rows=11112 width=28) > Hash Key: ten > Hash Key: hundred > Group Key: two > Sort Key: unique1 > Group Key: unique1 > Sort Key: thousand > Group Key: thousand > -> Sort (cost=1535.39..1560.39 rows=10000 width=20) > Sort Key: two > -> Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=20) > > (3 sorts, vs. 2 in the previous example that listed the same grouping > sets in different order) > > Current status of the work: probably still needs cleanup, maybe some > better regression tests, and Andres has expressed some concern over the > performance impact of additional complexity in advance_aggregates; it > would be useful if this could be tested. But it should be reviewable > as-is.
This doesn't apply cleanly to latest master. Could you please post a rebased patch? Thanks Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers