Hello Michael Ho, Tim Armstrong, Impala Public Jenkins, Vuk Ercegovac, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/10771 to look at the new patch set (#8). Change subject: IMPALA-110: Support for multiple DISTINCT ...................................................................... IMPALA-110: Support for multiple DISTINCT This patch adds support for having multiple aggregate functions in a single SELECT block that use DISTINCT over different sets of columns. Planner design: - The existing tree-based plan shape with a two-phased aggregation is maintained. - Existing plans are not changed. - Aggregates are grouped into 'aggregation classes' based on their expressions in the distinct portion which may be empty for non-distinct aggregates. - The aggregation framework is generalized to simultaneously process multiple aggregation classes within the tree-based plan. This process splits the results of different aggregation classes into separate rows, so a final aggregation is needed to transpose the results into the desired form. - Main challenge: Each aggregation class consumes and produces different tuples, so conceptually a union-type of tuples flows through the runtime. The tuple union is represented by a TupleRow with one tuple per aggregation class. Only one tuple in such a TupleRow is non-NULL. - Backend exec nodes in the aggregation plan will be aware of this tuple-union either explicitly in their implementation or by relying on expressions that distinguish the aggregation classes. - To distinguish the aggregation classes, e.g. in hash exchanges, CASE expressions are crafted to hash/group on the appropriate slots. Deferred FE work: - Beautify/condense the long CASE exprs - Push applicable conjuncts into individual aggregators before the transposition step - Added a few testing TODOs to reduce the size of this patch - Decide whether we want to change existing plans to the new model Execution design: - Previous patches separated out aggregation logic from the exec node into Aggregators. This is extended to support multiple Aggregators per node, with different grouping and aggregating functions. - There is a fast path for aggregations with only one aggregator, which leaves the execution essentially unchanged from before. - When there are multiple aggregators, the first aggregation node in the plan replicates its input to each aggregator. The output of this step is rows where only a single tuple is non-null, corresponding to the aggregator that produced the row. - A new expr is introduced, ValidTupleId, which takes one of these rows and returns which tuple is non-null. - For additional aggregation nodes, the input is split apart into 'mini-batches' according to which aggregator the row corresponds to. Testing: - Added analyzer and planner tests - Added end-to-end queries tests - Ran hdfs/core tests Change-Id: I055402eaef6d81e5f70e850d9f8a621e766830a4 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/CMakeLists.txt A be/src/exec/aggregation-node-base.cc A be/src/exec/aggregation-node-base.h M be/src/exec/aggregation-node.cc M be/src/exec/aggregation-node.h M be/src/exec/aggregator.cc M be/src/exec/aggregator.h M be/src/exec/exec-node.cc M be/src/exec/grouping-aggregator-ir.cc M be/src/exec/grouping-aggregator.cc M be/src/exec/grouping-aggregator.h M be/src/exec/non-grouping-aggregator.cc M be/src/exec/non-grouping-aggregator.h M be/src/exec/streaming-aggregation-node.cc M be/src/exec/streaming-aggregation-node.h M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h M be/src/exprs/scalar-expr.cc A be/src/exprs/valid-tuple-id.cc A be/src/exprs/valid-tuple-id.h M be/src/runtime/row-batch.h M common/thrift/Exprs.thrift M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AggregateInfo.java M fe/src/main/java/org/apache/impala/analysis/AggregateInfoBase.java M fe/src/main/java/org/apache/impala/analysis/Expr.java A fe/src/main/java/org/apache/impala/analysis/MultiAggregateInfo.java M fe/src/main/java/org/apache/impala/analysis/NumericLiteral.java M fe/src/main/java/org/apache/impala/analysis/QueryStmt.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java M fe/src/main/java/org/apache/impala/analysis/StmtRewriter.java M fe/src/main/java/org/apache/impala/analysis/UnionStmt.java A fe/src/main/java/org/apache/impala/analysis/ValidTupleIdExpr.java M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/planner/AggregationNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/workloads/functional-planner/queries/PlannerTest/distinct.test A testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-limit.test A testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-materialization.test A testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct-predicates.test A testdata/workloads/functional-planner/queries/PlannerTest/multiple-distinct.test A testdata/workloads/functional-query/queries/QueryTest/multiple-distinct-aggs.test M testdata/workloads/functional-query/queries/QueryTest/spilling-aggs.test M tests/query_test/test_aggregation.py 52 files changed, 5,862 insertions(+), 805 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/10771/8 -- To view, visit http://gerrit.cloudera.org:8080/10771 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I055402eaef6d81e5f70e850d9f8a621e766830a4 Gerrit-Change-Number: 10771 Gerrit-PatchSet: 8 Gerrit-Owner: Thomas Marshall <thomasmarsh...@cmu.edu> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Thomas Marshall <thomasmarsh...@cmu.edu> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com>