----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/18210/#review34693 -----------------------------------------------------------
The current approach has shown poor performance. You can see the current approach in the description of this issue. This patch improves the performance of distinct aggregation. Unlike the current approach, in the this patch, GlobalPlanner builds three phase plan using two hash shuffles. Then, GlobalPlanner adds an enforcer of sort aggregation to the final execution block. As a result, it can reduce significantly intermediate data volume according to the cardinality of grouping columns. This patch also allows Tajo to support multiple distinct functions. For example, the following query works well. select l_orderkey, count(distinct l_partkey), sum(distinct l_partkey) from lineitem group by l_orderkey; But, the current patch still has some limitations. The above query includes there are two count distinct functions: count(distinct), sum(distinct). They use the same distinct column 'l_partkey', so it works well. In contrast, the following case where there are two or more distinct columns is not supported yet. select l_orderkey, count(distinct l_partkey), sum(distinct l_linenumber) from lineitem group by l_orderkey; If you submit such a query, you will see the following messages: "different DISTINCT columns are not supported yet: l_partkey, l_linenumber". In order to support this kind of queries, we need additional physical executors. I'll add this feature later in another Jira issue. - Hyunsik Choi On Feb. 18, 2014, 9:03 p.m., Hyunsik Choi wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/18210/ > ----------------------------------------------------------- > > (Updated Feb. 18, 2014, 9:03 p.m.) > > > Review request for Tajo. > > > Bugs: TAJO-601 > https://issues.apache.org/jira/browse/TAJO-601 > > > Repository: tajo > > > Description > ------- > > Currently, distinct aggregation queries are executed as follows: > * the first stage: it just shuffles tuples by hashing grouping keys. > * the second stage: it sorts them and executes sort aggregation. > > This way executes queries including distinct aggregation functions with only > two stages. But, it leads to large intermediate data during shuffle phase. > > This kind of query can be rewritten as two queries: > > [Original query] > ---------- > SELECT grp1, grp2, count(*) as total, count(distinct grp3) as distinct_col > from rel1 group by grp1, grp2; > ---------- > > [Rewritten query] > ---------- > SELECT grp1, grp2, sum(cnt) as total, count(grp3) as distinct_col from ( > SELECT grp1, grp2, grp3, count(*) as cnt from rel1 group by grp1, grp2, > grp3) tmp1 group by grp1, grp2 > ) table1; > ---------- > > I'm expecting that this rewrite will significantly reduce the intermediate > data volume and query response time in most cases. > > > Diffs > ----- > > tajo-common/src/main/java/org/apache/tajo/util/TUtil.java cc694d4 > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/eval/EvalTreeUtil.java > da05739 > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/function/builtin/SumDoubleDistinct.java > PRE-CREATION > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/function/builtin/SumFloat.java > 10fd720 > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/function/builtin/SumFloatDistinct.java > PRE-CREATION > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/function/builtin/SumIntDistinct.java > PRE-CREATION > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/function/builtin/SumLongDistinct.java > PRE-CREATION > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/planner/ExprsVerifier.java > b14c448 > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/planner/LogicalPlanner.java > f7c0bfa > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/planner/PlannerUtil.java > 624518b > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/planner/PreLogicalPlanVerifier.java > 6dac031 > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/planner/global/DataChannel.java > efa1e05 > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/planner/global/GlobalPlanner.java > f390b52 > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/planner/global/MasterPlan.java > 91f658d > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/planner/physical/SeqScanExec.java > a0c0eeb > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/planner/rewrite/FilterPushDownRule.java > 399903c > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/planner/rewrite/PartitionedTableRewriter.java > e5f7fb4 > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/engine/planner/rewrite/ProjectionPushDownRule.java > 633d0c1 > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/querymaster/QueryMaster.java > ae6d5eb > > tajo-core/tajo-core-backend/src/main/java/org/apache/tajo/master/querymaster/QueryMasterManagerService.java > 3c30e38 > > tajo-core/tajo-core-backend/src/test/java/org/apache/tajo/engine/eval/TestEvalTreeUtil.java > d756242 > > tajo-core/tajo-core-backend/src/test/java/org/apache/tajo/engine/query/TestGroupByQuery.java > 1f80bce > > tajo-core/tajo-core-backend/src/test/java/org/apache/tajo/master/TestExecutionBlockCursor.java > 053c028 > > tajo-core/tajo-core-backend/src/test/java/org/apache/tajo/master/TestGlobalPlanner.java > 2d3124d > > tajo-core/tajo-core-backend/src/test/resources/queries/TestGroupByQuery/testCountDistinct.sql > 6fe604e > > tajo-core/tajo-core-backend/src/test/resources/queries/TestGroupByQuery/testCountDistinct2.sql > 6bf8a8a > > tajo-core/tajo-core-backend/src/test/resources/queries/TestGroupByQuery/testDistinctAggregation1.sql > PRE-CREATION > > tajo-core/tajo-core-backend/src/test/resources/queries/TestGroupByQuery/testDistinctAggregation2.sql > PRE-CREATION > > tajo-core/tajo-core-backend/src/test/resources/queries/TestGroupByQuery/testDistinctAggregation3.sql > PRE-CREATION > > tajo-core/tajo-core-backend/src/test/resources/queries/TestGroupByQuery/testDistinctAggregation4.sql > PRE-CREATION > > tajo-core/tajo-core-backend/src/test/resources/queries/TestGroupByQuery/testDistinctAggregation5.sql > PRE-CREATION > > tajo-core/tajo-core-backend/src/test/resources/queries/TestGroupByQuery/testDistinctAggregationWithHaving1.sql > PRE-CREATION > > tajo-core/tajo-core-backend/src/test/resources/results/TestGroupByQuery/testCountDistinct.result > f2ad32a > > tajo-core/tajo-core-backend/src/test/resources/results/TestGroupByQuery/testCountDistinct2.result > 9164120 > > tajo-core/tajo-core-backend/src/test/resources/results/TestGroupByQuery/testDistinctAggregation1.result > PRE-CREATION > > tajo-core/tajo-core-backend/src/test/resources/results/TestGroupByQuery/testDistinctAggregation2.result > PRE-CREATION > > tajo-core/tajo-core-backend/src/test/resources/results/TestGroupByQuery/testDistinctAggregation3.result > PRE-CREATION > > tajo-core/tajo-core-backend/src/test/resources/results/TestGroupByQuery/testDistinctAggregation4.result > PRE-CREATION > > tajo-core/tajo-core-backend/src/test/resources/results/TestGroupByQuery/testDistinctAggregation5.result > PRE-CREATION > > tajo-core/tajo-core-backend/src/test/resources/results/TestGroupByQuery/testDistinctAggregationWithHaving1.result > PRE-CREATION > > Diff: https://reviews.apache.org/r/18210/diff/ > > > Testing > ------- > > mvn clean install > > > Thanks, > > Hyunsik Choi > >
