[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it is always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The elapsed time for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Reviewed-on: http://gerrit.cloudera.org:8080/16942 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,275 insertions(+), 355 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 16 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 15: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 15 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 22 Jan 2021 05:31:35 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 15: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 15 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 21 Jan 2021 23:55:18 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 15: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6858/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 15 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 21 Jan 2021 23:55:19 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 14: Code-Review+2 BE changes look good to me -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 14 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 21 Jan 2021 23:50:03 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 14: Code-Review+1 Planner changes lgtm. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 14 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 20 Jan 2021 18:13:54 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 14: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8021/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 14 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 19 Jan 2021 20:01:39 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 11: (11 comments) http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@9 PS11, Line 9: when > nit: remove Done http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@22 PS11, Line 22: limit > did you mean partition limit >= order by limit ? Done http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@32 PS11, Line 32: was > nit: 'is' Done http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java: http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@507 PS11, Line 507: upper top-n. > Since we do a distributed top-n, there are 3 top-n's in the plan and the 'u Good point - used final/analytic instead of upper/lower. http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@511 PS11, Line 511: doesn't matter > nit: worth clarifying that it doesn't matter for the purpose of the pushdow Done http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@530 PS11, Line 530: include all of the rows in the final > This does not literally mean all rows in the final partition right ? Should Yup that's true. http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@531 PS11, Line 531: was > nit: 'we' Done http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@538 PS11, Line 538:* TODO: this should also return the amount that needs to be added to the limit I already did this, removed TODO http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@603 PS11, Line 603: if (analyticLimit < limit) return falseStatus; > One special case where this could work is if each partition had a maximum o I think the partitioned top-n node would let us correctly apply the limit at runtime, if we wanted to further optimize this (i.e. I think at any point during the partitioned top-n, you could discard any in-memory partitions that didn't include the first N rows) http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java File fe/src/main/java/org/apache/impala/planner/SortNode.java: http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java@83 PS11, Line 83: row.s > nit: 'rows' Done http://gerrit.cloudera.org:8080/#/c/16942/11/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test File testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test: http://gerrit.cloudera.org:8080/#/c/16942/11/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test@1139 PS11, Line 1139: # rank() predicate is not pushed down because TOPN_BYTES_LIMIT prevents conversion > The plan shows the lower top-n which indicates the rank was pushed down. Pe Thanks for catching this. The problem was that the table only had 8 rows, so the estimate for the top n was 18B * 8 = 144B, under the threshold. I switched to a different table with more rows and it now shows the expected behavior -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 11 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 19 Jan 2021 19:40:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#14). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it is always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The elapsed time for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,275 insertions(+), 355 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/14 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 14 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 11: (12 comments) Sending comments based on 1st pass of the planner changes. http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@9 PS11, Line 9: when nit: remove http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@22 PS11, Line 22: limit did you mean partition limit >= order by limit ? http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@32 PS11, Line 32: was nit: 'is' http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@40 PS11, Line 40: This patch implements tie handling in the backend (I took most I had previously wondered about how the planner and backend work for this functionality would be combined but it now starts falling into place with the handling of the duplicates :-) http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@68 PS11, Line 68: The for nit: 'The elapsed time for..' ? http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java: http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@507 PS11, Line 507: upper top-n. Since we do a distributed top-n, there are 3 top-n's in the plan and the 'upper top-n' may be confusing. Here it refers to the outermost top-n or final top-n or something similar. http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@511 PS11, Line 511: doesn't matter nit: worth clarifying that it doesn't matter for the purpose of the pushdown decision. http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@530 PS11, Line 530: include all of the rows in the final This does not literally mean all rows in the final partition right ? Should it be all eligible rows or all relevant rows ? (based on the P+N value) http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@531 PS11, Line 531: was nit: 'we' http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@603 PS11, Line 603: if (analyticLimit < limit) return falseStatus; One special case where this could work is if each partition had a maximum of analyticLimit rows. Then we know we are not excluding rows as we iterate through the partitions until we reach the final limit. Of course, this knowledge is not readily available or may not be relied upon due to ndv estimates but in practice I suspect this may not be uncommon. For this patch, it makes sense to be conservative in applying the optimization. http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java File fe/src/main/java/org/apache/impala/planner/SortNode.java: http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java@83 PS11, Line 83: row.s nit: 'rows' http://gerrit.cloudera.org:8080/#/c/16942/11/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test File testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test: http://gerrit.cloudera.org:8080/#/c/16942/11/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test@1139 PS11, Line 1139: # rank() predicate is not pushed down because TOPN_BYTES_LIMIT prevents conversion The plan shows the lower top-n which indicates the rank was pushed down. Perhaps the top_bytes_limit should be even smaller if this is supposed to be a negative test. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 11 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sun, 17 Jan 2021 00:36:51 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 13: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8011/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 13 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 16 Jan 2021 02:09:23 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 12: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8010/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 12 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 16 Jan 2021 02:09:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 12: (1 comment) http://gerrit.cloudera.org:8080/#/c/16942/12/fe/src/main/java/org/apache/impala/planner/SortNode.java File fe/src/main/java/org/apache/impala/planner/SortNode.java: http://gerrit.cloudera.org:8080/#/c/16942/12/fe/src/main/java/org/apache/impala/planner/SortNode.java@283 PS12, Line 283: (includeTies_ && limitWithTies_ >= 0), "Top-N must have limit", debugString()); line too long (91 > 90) -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 12 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 16 Jan 2021 01:48:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 13: (10 comments) http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc File be/src/exec/topn-node-ir.cc: http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@44 PS11, Line 44: priority_queue_.Push(insert_tuple); > Might be useful to explicitly mention here that we don't have to worry abou Done http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@91 PS11, Line 91: else { : // 'materialized_tuple' needs to be > I might move this above the DCHECK, to make it clearer its referring to the Done http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@93 PS11, Line 93: from the heap. : DCHECK_LT(cmp_result > Not sure what this is supposed to mean, since my understanding would be tha I'm not sure exactly what I was getting at, I think I was just restating the invariant that everything in 'overflowed_ties_' was equal with the head of the queue, so it didn't add much. http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@97 PS11, Line 97: > Might be worth explicitly saying that 'top_tuple' is what we just popped of Done http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@100 PS11, Line 100: node->tuple_row_less_than_->Compare(top_tuple, priority_queue_.Top()) == 0) { > I find this comment a little confusing, since we've already done the one Po Thanks, that's clearer. http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h File be/src/exec/topn-node.h: http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@86 PS11, Line 86: /// TODO: currently we only support a single top-n heap per operator. IMPALA-9979 will > nit: its a little confusing to call out 'unpartitioned mode' specifically w Yeah I had to pull this comment out of the partitioned mode patch and didn't want to create too many merge conflicts by editing comments. I added a TODO as suggested. http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@180 PS11, Line 180: RuntimeProfile::Counter* tuple_pool_reclaim_counter_= nullptr; > The counters here and below aren't used in this patch Done http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@200 PS11, Line 200: n data structure use > nit: not in this patch Done http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@256 PS11, Line 256: const int64_t capacity_; > If I understand correctly, this function only gets called once the heap has Done. Updated comment and added DCHECK. http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.cc File be/src/exec/topn-node.cc: http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.cc@306 PS11, Line 306: heap is at > I think this is a typo? i.e. these two words should be removed Done -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 13 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 16 Jan 2021 01:48:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#13). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,276 insertions(+), 354 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/13 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 13 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#12). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,275 insertions(+), 354 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/12 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 12 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 11: (10 comments) I've only looked at the BE part and tests so far, but I can try to wrap my head around the FE changes too if its helpful. I think the BE logic looks good, but since its complex it took me awhile to convince myself it was correct, so most of my suggestions are around cleaning up comments http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc File be/src/exec/topn-node-ir.cc: http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@44 PS11, Line 44: priority_queue_.Push(insert_tuple); Might be useful to explicitly mention here that we don't have to worry about ties going into the heap at this point, as we'll handle that with the complicated logic in InsertTupleWithTieHandling later if necessary. http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@91 PS11, Line 91: // 'materialized_tuple' needs to be added. Figure out which other tuples, if any, : // need to be removed from the heap. I might move this above the DCHECK, to make it clearer its referring to the entire 'else' branch. http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@93 PS11, Line 93: It will compare equal with any overflowed ties at the back of : // 'priority_queue_' Not sure what this is supposed to mean, since my understanding would be that 'overflowed' ties aren't in 'priority_queue_'. http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@97 PS11, Line 97: 'top_tuple' Might be worth explicitly saying that 'top_tuple' is what we just popped off the heap. http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@100 PS11, Line 100: // Removing all the ties would bring the heap below its capacity. We need to I find this comment a little confusing, since we've already done the one Pop() we need to do, its unclear why we would think about doing another Pop() and bring the heap below capacity just because the new top is tied with the old one. I think it would be better to say something along the lines of "The new top is still tied with the tuples in 'overflowed_ties_' so don't clear it, just add the previous top as another overflowed tuple" http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h File be/src/exec/topn-node.h: http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@86 PS11, Line 86: /// Unpartitioned TopN Implementation Details nit: its a little confusing to call out 'unpartitioned mode' specifically when its the only mode currently supported. I guess it doesn't matter if the partitioned mode patch is going in soon, but you could add a "TODO: implement partitioned mode" or something. http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@180 PS11, Line 180: RuntimeProfile::Counter* num_partitions_counter_ = nullptr; The counters here and below aren't used in this patch http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@200 PS11, Line 200: GetNextPartitioned() nit: not in this patch http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@256 PS11, Line 256: /// Helper to insert tuple row into the priority queue with tie handling. If I understand correctly, this function only gets called once the heap has reached capacity. I suppose that's sort of implicit, since that's when tie handling starts to matter, but might be useful to call this out explicitly. http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.cc File be/src/exec/topn-node.cc: http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.cc@306 PS11, Line 306: addition to I think this is a typo? i.e. these two words should be removed -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 11 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 16 Jan 2021 00:12:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 11: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8000/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 11 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 14 Jan 2021 07:02:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 10: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7997/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 10 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 14 Jan 2021 06:53:08 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java File fe/src/main/java/org/apache/impala/planner/SortNode.java: http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java@283 PS11, Line 283: (includeTies_ && limitWithTies_ >= 0), "Top-N must have limit", debugString()); line too long (91 > 90) -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 11 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 14 Jan 2021 06:40:35 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#11). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,285 insertions(+), 353 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/11 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 11 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/16942/10/fe/src/main/java/org/apache/impala/planner/SortNode.java File fe/src/main/java/org/apache/impala/planner/SortNode.java: http://gerrit.cloudera.org:8080/#/c/16942/10/fe/src/main/java/org/apache/impala/planner/SortNode.java@283 PS10, Line 283: (includeTies_ && limitWithTies_ >= 0), "Top-N must have limit", debugString()); line too long (91 > 90) -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 10 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 14 Jan 2021 06:32:36 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#10). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,278 insertions(+), 348 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/10 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 10 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 9: Thanks for fixing this important scenario. I will try and review the planner changes in the next day or so. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 9 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 14 Jan 2021 01:16:00 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7993/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 9 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 14 Jan 2021 01:11:14 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/16942/8/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java: http://gerrit.cloudera.org:8080/#/c/16942/8/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@585 PS8, Line 585: if (!(pred.getOp() == BinaryPredicate.Operator.EQ || > I think we need to adjust the handling of EQ here, I don't think we can pus Done -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 8 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 14 Jan 2021 00:49:04 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#9). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,232 insertions(+), 344 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/9 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 9 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/16942/8/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java: http://gerrit.cloudera.org:8080/#/c/16942/8/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@585 PS8, Line 585: if (!(pred.getOp() == BinaryPredicate.Operator.EQ || I think we need to adjust the handling of EQ here, I don't think we can push when the value is >= 1 since the predicate will filter out an arbitrary number of rows (maybe you can be clever in some cases, but I think that's generally true). E.g. select * from ( select d_date, i_item_id, ss_list_price, rank() over (partition by d_date order by ss_list_price desc) rnk from store_sales ss join item i on ss_item_sk = i_item_sk join date_dim d on ss_sold_date_sk = d_date_sk where ss_list_price is not null) v where rnk = 100 order by d_date, i_item_id limit 50 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 8 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 13 Jan 2021 22:51:10 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 8: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7989/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 8 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 13 Jan 2021 18:19:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#8). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,143 insertions(+), 343 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/8 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 8 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 7: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7988/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 7 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 13 Jan 2021 05:35:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/16942/7/tests/query_test/test_limit_pushdown_analytic.py File tests/query_test/test_limit_pushdown_analytic.py: http://gerrit.cloudera.org:8080/#/c/16942/7/tests/query_test/test_limit_pushdown_analytic.py@25 PS7, Line 25: class TestLimitPushdownAnalyticTpch(ImpalaTestSuite): flake8: E302 expected 2 blank lines, found 1 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 7 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 13 Jan 2021 05:14:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#7). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,143 insertions(+), 344 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/7 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 7 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 6: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7987/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 6 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 13 Jan 2021 03:25:03 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/16942/6/fe/src/main/java/org/apache/impala/planner/SortNode.java File fe/src/main/java/org/apache/impala/planner/SortNode.java: http://gerrit.cloudera.org:8080/#/c/16942/6/fe/src/main/java/org/apache/impala/planner/SortNode.java@158 PS6, Line 158:* This does the conversion to top-n if the converted node would pass the TOPN_BYTES_LIMIT line too long (92 > 90) -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 6 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 13 Jan 2021 03:03:50 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#6). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira TODO: * Add some end-to-end tests that stress tie-handling more Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 15 files changed, 993 insertions(+), 341 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/6 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 6 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 5: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7983/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 5 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 13 Jan 2021 02:47:48 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/16942/5/fe/src/main/java/org/apache/impala/planner/SortNode.java File fe/src/main/java/org/apache/impala/planner/SortNode.java: http://gerrit.cloudera.org:8080/#/c/16942/5/fe/src/main/java/org/apache/impala/planner/SortNode.java@158 PS5, Line 158:* This does the conversion to top-n if the converted node would pass the TOPN_BYTES_LIMIT line too long (92 > 90) -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 5 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 13 Jan 2021 02:26:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#5). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira TODO: * Add some end-to-end tests that stress tie-handling more Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. TODO: check for changes in regular top-n performance. Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 15 files changed, 990 insertions(+), 341 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/5 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 5 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7981/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 12 Jan 2021 20:14:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7980/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 12 Jan 2021 20:09:11 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Hello Aman Sinha, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#4). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). TODO: apply top-n threshold to avoid increasing limit enormously Testing: * Add new planner test with negative case where it's rejected * Update other planner tests to reflect new limits + tie handling * Add some end-to-end tests that repro bugs TODO: * Add planner tests for edge cases with limits - =, <, etc * Add planner test for row_number() * Add planner test for very high rank predicate that overflows int * Add some end-to-end tests that stress tie-handling more * test perf - q67 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 15 files changed, 790 insertions(+), 329 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/4 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. Patch Set 3: I finally got around to fixing this - it required grabbing the tie handling logic from https://gerrit.cloudera.org/#/c/16242/ to fix it correctly. I still need to add some additional tests, noted in the commit message, but I thought it was in a good state to get feedback on the approach -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 12 Jan 2021 19:47:39 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present
Tim Armstrong has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/16942 ) Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present .. IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). TODO: apply top-n threshold to avoid increasing limit enormously Testing: * Add new planner test with negative case where it's rejected * Update other planner tests to reflect new limits + tie handling * Add some end-to-end tests that repro bugs TODO: * Add planner tests for edge cases with limits - =, <, etc * Add planner test for row_number() * Add planner test for very high rank predicate that overflows int * Add some end-to-end tests that stress tie-handling more * test perf - q67 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/analysis/AnalyticWindow.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 17 files changed, 798 insertions(+), 335 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/3 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong