On Mon, Jan 9, 2017 at 11:26 PM, Antonin Houska <a...@cybertec.at> wrote:
> Attached is a draft patch that lets partial aggregation happen at base
> relation level. If the relations contain relatively small number of groups,
> the number of input rows of the aggregation at the query level can be reduced
> this way.  Also, if append relation and postgres_fdw planning is enhanced
> accordingly, patch like this can let us aggregate individual tables on remote
> servers (e.g. shard nodes) and thus reduce the amount of rows subject to the
> final aggregation.

For an appendrel probably you need an ability to switch group->append
into append->group. For postgres_fdw, we already support aggregate
pushdown. But we don't support fetching partial aggregates from
foreign server. What other enhancements do you need?

>
> For example, consider query
>
> SELECT b.j, sum(a.x) FROM a, b WHERE a.i = b.j GROUP BY b.j;
>
> and tables "a"
>
>  i | x
> -------
>  1 | 3
>  1 | 4
>
> and "b"
>
>  j
> ---
>  1
>  1
>
> The base relations grouped look like
>
>  i | sum(a.x)| count(*)
> -----------------------
>  1 |       7 |       2
>
> and
>
>  j | count(*)
> -------------
>  1 |       2
>

Looks like an interesting technique.

>
> A few "footnotes":
>
> * Equivalence class {a.i, b.j} tells that "a" can be grouped by "i", besides
> the grouping of "b" which is explicitly required by GROUP BY b.j clause.)
>
> * To transfer the aggregate results to upper nodes, I introduced a concept of
> "grouped variable". Base relation has special target which the planner uses to
> generate "grouped paths". The grouped target contains one grouped variable per
> aggregate that the relation computes. During final processing of the plan
> (setrefs.c), the corresponding (partial) aggregate is restored in the query
> target if needed - typically this happens to ensure that the final aggregate
> references the output of the partial aggregate.
>
> * So far the grouped variable is only used for aggregates, but it might be
> useful for grouping expressions in the future as well. Currently the base
> relation can only be grouped by a plain Var, but it might be worth grouping it
> by generic grouping expressions of the GROUP BY clause, and using the grouped
> var mechanism to propagate the expression value to the query target.
>
>
> As for the example, the processing continues by joining the partially grouped
> sets:
>
>  i | sum(x)| count(i.*) | j | count(j.*)
> ----------------------------------------
>  1 |     7 |       2    | 1 |       3
>
>
> Before performing the final aggregation, we need to multiply sum(a.x) by
> count(j.*) because w/o the aggregation at base relation level the input
> of the query-level aggregation would look like
>
>  a.i | a.x | b.j
> ----------------
>  1   |   3 |  1
>  1   |   4 |  1
>  1   |   3 |  1
>  1   |   4 |  1
>
> In other words, grouping of the base relation "b" below the join prevents the
> join from bringing per-group input set to the aggregate input multiple
> times. To compensate for this effect, I've added a new field "aggtransmultifn"
> to pg_aggregate catalog. It "multiplies" the aggregate state w/o repeated
> processing of the same input set many times. sum() is an example of an
> aggregate that needs such processing, avg() is one that does not.

For something like product aggregation, where the product (or higher
order operations) across rows is accumulated instead of sum, mere
multiplication wouldn't help. We will need some higher order operation
to "extrapolate"
the result based on count(j.*). In fact, the multiplication factor
will depend upon the number of relations being joined E.g.
select b.j, sum(a.x) where a, b, c where a.i = b.j and a.i = c.k group by b.j

>
> The example query can eventually produce plans like
>
>                       QUERY PLAN
> ------------------------------------------------------
>  Finalize HashAggregate
>    Group Key: b.j
>    ->  Gather
>          Workers Planned: 2
>          ->  Hash Join
>                Hash Cond: (a.i = b.j)
>                ->  Partial HashAggregate
>                      Group Key: a.i
>                      ->  Parallel Seq Scan on a
>                ->  Hash
>                      ->  Partial HashAggregate
>                            Group Key: b.j
>                            ->  Parallel Seq Scan on b
>
> or
>
>                       QUERY PLAN
> ------------------------------------------------------
>  Finalize HashAggregate
>    Group Key: b.j
>    ->  Hash Join
>          Hash Cond: (a.i = b.j)
>          ->  Gather
>                Workers Planned: 2
>                ->  Partial HashAggregate
>                      Group Key: a.i
>                      ->  Parallel Seq Scan on a
>          ->  Hash
>                ->  Gather
>                      Workers Planned: 1
>                      ->  Partial HashAggregate
>                            Group Key: b.j
>                            ->  Parallel Seq Scan on b
>
>
> An obvious limitation is that neither grouping expression nor aggregate
> argument can be below the nullable side of outer join. In such a case the
> aggregate at the base relation level wouldn't receive the NULL values that it
> does receive at the query level. Also, no aggregate can reference multiple
> tables.
>
> Does this concept seem worth to continue coding?
>

May be we want to implement this technique without partial aggregation
first i.e. push down aggregation and grouping down the join tree and
then add partial aggregation steps. That might make it easy to review.
Looking at the changes in create_plain_partial_paths(), it looks like
we use this technique only in case of parallel query. I think the
technique is useful otherwise as well.

Also, if we could generalize this technique to push
aggregation/grouping upto any relation (not just base but join as
well) where it can be calculated that may be better. Trying that might
lead us to a better design; which right now is focused only on base
relations.

>
> BTW, if anyone wants to play with the current version:
>
> 1. Don't forget to initialize a new cluster (initdb) first. I decided not to
> bump CATALOG_VERSION_NO so far because it'd probably make the patch
> incompatible with master branch quite soon.
>
> 2. Only hash aggregation is implemented so far at the base relation level.
>
> 3. As for sum() aggregate, only sum(float4) is supposed to work correctly so
> far - this is related to the pg_aggregate changes mentioned above. avg()
> should work in general, and I didn't care about the other ones so far.
>
> 4. As for joins, only hash join is able to process the grouped relations. I
> didn't want to do too much coding until there's a consensus on the design.
>

Probably it's too early to review code, but ...
+     /*
+      * If no join is expected, aggregation at base relation level makes no
+      * sense. XXX Is there simpler way to find out? (We're not interested in
+      * RELOPT_OTHER_MEMBER_REL, so simple_rel_array_size does not help.)
+      */
+     for (i = 1; i < root->simple_rel_array_size; i++)
+     {
+         RelOptInfo *rel;
+
+         rel = find_base_rel(root, i);
+         if (rel->reloptkind == RELOPT_BASEREL)
+         {
+             nbaserels++;
+             /*
+              * We only want to know whether the number of relations is greater
+              * than one.
+              */
+             if (nbaserels > 1)
+                 break;
+         }
+     }

You might want to check bms_membership(root->all_baserels), instead of
this loop.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to