Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/16942 to look at the new patch set (#11). Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are present ...................................................................... IMPALA-10296: Fix analytic limit pushdown when predicates are present This fixes the analytic push down optimization for the case when where the ORDER BY expressions are compatible with the partitioning of the analytic *and* there is a rank() or row_number() predicate. In this case the rows returned are going to come from the first partitions, i.e. if the limit is 100, if we go through the partitions in order until the row count adds up to 100, then we know that the rows must come from those partitions. The problem is that predicates can discard rows from the partitions, meaning that a limit naively pushed down to the top-n will filter out rows that could be returned from the query. We can avoid the problem in the case where the partition limit >= limit order by limit, however. In this case the relevant set of partitions is the set of partitions that include the first <limit> rows, since the top-level limit generally kicks in before the per-partition limit. The only twist is that the orderings may be different within a partition, so we need to make sure to include all of the rows in the final partition. The solution implemented in this patch is to increase the pushed down limit so that it was always guaranteed to include all of the rows in the final partition to be returned. E.g. if you had a row_number() <= 100 predicate and limit 100, if you pushed down limit 200, then you'd be guaranteed to capture all of the rows in the final partition. One case we need to handle is that, in the case of a rank() predicate, we can have more than that number of rows in the partition because of ties. This patch implements tie handling in the backend (I took most of that implementation from my in-progress partitioned top-n patch, with the intention of rebasing that onto this patch). This also adds a check against TOPN_BYTES_LIMIT so that the limit can't be increased to an arbitarily large value. Testing: * Add new planner test with negative case where it's rejected because the transformation is incorrect. * Update other planner tests to reflect new limit calculation + tie handling required for correctness. * Add planner test for very high rank predicate that overflows int32 * Add planner test that checks TOPN_BYTES_LIMIT handling * Add planner test that checks that dense_rank() can't be pushed. * Existing planner tests already have adequate coverage for predicates : <=, <, = and row_number(). * Add some end-to-end tests that repro bugs that fall under the jira * Add an end-to-end test on TPC-H with more data to exercise the tie-handling logic in the execnode more. Perf: Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was no measurable change in performance as a result of this patchset. Ran TPC-H scale 30 on a single node, no significant perf change. Ran a targeted query to check for regressions in the top-n node. The for this targeted query did not change: use tpch30_parquet; set mt_dop=1; select l_extendedprice from lineitem order by 1 limit 100 Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 --- M be/src/exec/topn-node-ir.cc M be/src/exec/topn-node.cc M be/src/exec/topn-node.h M be/src/util/tuple-row-compare.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test M testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test M testdata/workloads/tpch/queries/limit-pushdown-analytic.test M tests/query_test/test_limit_pushdown_analytic.py 16 files changed, 1,285 insertions(+), 353 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/11 -- To view, visit http://gerrit.cloudera.org:8080/16942 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509 Gerrit-Change-Number: 16942 Gerrit-PatchSet: 11 Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>