GitHub user jianqiao opened a pull request:
https://github.com/apache/incubator-quickstep/pull/237
QUICKSTEP-89 Add support for common subexpression.
This PR adds support for common subexpression elimination in Quickstep. In
summary we apply two kinds of optimizations:
**(1)** Query rewriting for _aggregate expressions_, e.g.
```
SELECT SUM(x), SUM(x)*2, AVG(x), COUNT(x) FROM s;
```
will be rewritten as
```
SELECT sum_x, sum_x * 2, sum_x / count_x, count_x
FROM (
SELECT SUM(x) AS sum_x, COUNT(x) AS count_x
) t;
```
**(2)** Common subexpression labeling with **per-storage-block memoization
table** to cache result column vectors to avoid redundant computation. For
example, the evaluation of expressions in
```
SELECT (x + 1) * (y + 2) * 3, (x + 1) * (y + 2), x + 1
FROM s;
```
will essentially look like:
```
t1 = x + 1
t2 = t1 * (y + 2)
Out1 = t2 * 3
Out2 = Reference(t2)
Out3 = Reference(t3)
```
Here `t1`, `t2`, `Out1`, `Out2`, `Out3` are per-storage-block
`ColumnVector`'s.
This PR together with a (later) new aggregation hash table implementation
will improve the performance of TPC-H Q01 from ~11.5s to ~5.3s, with scale
factor 100 on a cloud lab machine.
___
**Note**: For optimization (1), note that it is not free-of-cost to
"re-project" the columns, so it may not worth doing the transformation in
situations where a group-by aggregation has large number of groups. E.g., it
may actually slow down the query to rewrite
```
SELECT SUM(a), SUM(b), SUM(c), SUM(d), SUM(a)
FROM s
GROUP BY x;
```
as
```
SELECT sum_a, sum_b, sum_c, sum_d, sum_a
FROM (
SELECT SUM(a) AS sum_a, SUM(b) AS sum_b, SUM(c) AS sum_c, SUM(d) AS sum_d
FROM s
GROUP BY x;
) t;
```
if the number of distinct values of attribute x is large -- in that case,
we avoid one duplicate computation of `SUM(a)`, but introduce 5 extra expensive
column copying due to the intermediate relation `t`.
Thus currently we employ a simple heuristic that if either
_(1) The estimated number of groups for the Aggregate node is smaller than
the specified `FLAGS_reuse_aggregate_group_size_threshold`._
or
_(2) The ratio of (# of eliminable columns) to (# of original columns) is
greater than the specified `FLAGS_reuse_aggregate_ratio_threshold`._
then we perform the transformation.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/incubator-quickstep
common-subexpression
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-quickstep/pull/237.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #237
----
commit b31f0360a93727ed1250e8e3cf28c242c3d727a1
Author: Jianqiao Zhu <[email protected]>
Date: 2017-04-20T20:28:12Z
Add common-subexpression support.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---