[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >=
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it is always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The elapsed time for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Reviewed-on: http://gerrit.cloudera.org:8080/16942
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,275 insertions(+), 355 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 16
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim 

[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 15: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 15
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 22 Jan 2021 05:31:35 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 15: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 15
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 23:55:18 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 15:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6858/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 15
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 23:55:19 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-21 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 14: Code-Review+2

BE changes look good to me


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 21 Jan 2021 23:50:03 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-20 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 14: Code-Review+1

Planner changes lgtm.


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 20 Jan 2021 18:13:54 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 14:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8021/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 19 Jan 2021 20:01:39 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-19 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 11:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@9
PS11, Line 9: when
> nit: remove
Done


http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@22
PS11, Line 22: limit
> did you mean partition limit >= order by limit ?
Done


http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@32
PS11, Line 32: was
> nit: 'is'
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@507
PS11, Line 507: upper top-n.
> Since we do a distributed top-n, there are 3 top-n's in the plan and the 'u
Good point - used final/analytic instead of upper/lower.


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@511
PS11, Line 511: doesn't matter
> nit: worth clarifying that it doesn't matter for the purpose of the pushdow
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@530
PS11, Line 530: include all of the rows in the final
> This does not literally mean all rows in the final partition right ? Should
Yup that's true.


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@531
PS11, Line 531: was
> nit: 'we'
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@538
PS11, Line 538:* TODO: this should also return the amount that needs to be 
added to the limit
I already did this, removed TODO


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@603
PS11, Line 603: if (analyticLimit < limit) return falseStatus;
> One special case where this could work is if each partition had a maximum o
I think the partitioned top-n node would let us correctly apply the limit at 
runtime, if we wanted to further optimize this (i.e. I think at any point 
during the partitioned top-n, you could discard any in-memory partitions that 
didn't include the first N rows)


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java@83
PS11, Line 83: row.s
> nit: 'rows'
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test:

http://gerrit.cloudera.org:8080/#/c/16942/11/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test@1139
PS11, Line 1139: # rank() predicate is not pushed down because TOPN_BYTES_LIMIT 
prevents conversion
> The plan shows the lower top-n which indicates the rank was pushed down. Pe
Thanks for catching this. The problem was that the table only had 8 rows, so 
the estimate for the top n was 18B * 8 = 144B, under the threshold. I switched 
to a different table with more rows and it now shows the expected behavior



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 19 Jan 2021 19:40:05 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-19 Thread Tim Armstrong (Code Review)
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#14).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >=
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it is always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The elapsed time for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,275 insertions(+), 355 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/14
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-16 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 11:

(12 comments)

Sending comments based on 1st pass of the planner changes.

http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@9
PS11, Line 9: when
nit: remove


http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@22
PS11, Line 22: limit
did you mean partition limit >= order by limit ?


http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@32
PS11, Line 32: was
nit: 'is'


http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@40
PS11, Line 40: This patch implements tie handling in the backend (I took most
I had previously wondered about how the planner and backend  work for this 
functionality would be combined but it now starts falling into place with the 
handling of the duplicates :-)


http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@68
PS11, Line 68: The for
nit: 'The elapsed time for..' ?


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@507
PS11, Line 507: upper top-n.
Since we do a distributed top-n, there are 3 top-n's in the plan and the 'upper 
top-n' may be confusing. Here it refers to the outermost top-n or final top-n 
or something  similar.


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@511
PS11, Line 511: doesn't matter
nit: worth clarifying that it doesn't matter for the purpose of the pushdown 
decision.


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@530
PS11, Line 530: include all of the rows in the final
This does not literally mean all rows in the final partition right ? Should it 
be all eligible rows or all relevant rows ? (based on the P+N value)


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@531
PS11, Line 531: was
nit: 'we'


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@603
PS11, Line 603: if (analyticLimit < limit) return falseStatus;
One special case where this could work is if each partition had a maximum of 
analyticLimit rows.  Then we know we are not excluding rows as we iterate 
through the partitions until we reach the final limit.  Of course, this 
knowledge is not readily available or may not be relied upon due to ndv 
estimates but in practice I suspect this may not be uncommon.
For this patch, it makes sense to be conservative in applying the optimization.


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java@83
PS11, Line 83: row.s
nit: 'rows'


http://gerrit.cloudera.org:8080/#/c/16942/11/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test:

http://gerrit.cloudera.org:8080/#/c/16942/11/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test@1139
PS11, Line 1139: # rank() predicate is not pushed down because TOPN_BYTES_LIMIT 
prevents conversion
The plan shows the lower top-n which indicates the rank was pushed down. 
Perhaps the top_bytes_limit should be even smaller if this is supposed to be a 
negative test.



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sun, 17 Jan 2021 00:36:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 13:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8011/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 13
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 16 Jan 2021 02:09:23 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 12:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8010/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 12
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 16 Jan 2021 02:09:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16942/12/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/12/fe/src/main/java/org/apache/impala/planner/SortNode.java@283
PS12, Line 283: (includeTies_ && limitWithTies_ >= 0), "Top-N must 
have limit", debugString());
line too long (91 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 12
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 16 Jan 2021 01:48:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 13:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc
File be/src/exec/topn-node-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@44
PS11, Line 44: priority_queue_.Push(insert_tuple);
> Might be useful to explicitly mention here that we don't have to worry abou
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@91
PS11, Line 91: else {
 : // 'materialized_tuple' needs to be
> I might move this above the DCHECK, to make it clearer its referring to the
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@93
PS11, Line 93:  from the heap.
 : DCHECK_LT(cmp_result
> Not sure what this is supposed to mean, since my understanding would be tha
I'm not sure exactly what I was getting at, I think I was just restating the 
invariant that everything in 'overflowed_ties_' was equal with the head of the 
queue, so it didn't add much.


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@97
PS11, Line 97:
> Might be worth explicitly saying that 'top_tuple' is what we just popped of
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@100
PS11, Line 100: node->tuple_row_less_than_->Compare(top_tuple, 
priority_queue_.Top()) == 0) {
> I find this comment a little confusing, since we've already done the one Po
Thanks, that's clearer.


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h
File be/src/exec/topn-node.h:

http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@86
PS11, Line 86: /// TODO: currently we only support a single top-n heap per 
operator. IMPALA-9979 will
> nit: its a little confusing to call out 'unpartitioned mode' specifically w
Yeah I had to pull this comment out of the partitioned mode patch and didn't 
want to create too many merge conflicts by editing comments. I added a TODO as 
suggested.


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@180
PS11, Line 180:   RuntimeProfile::Counter* tuple_pool_reclaim_counter_= nullptr;
> The counters here and below aren't used in this patch
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@200
PS11, Line 200: n data structure use
> nit: not in this patch
Done


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@256
PS11, Line 256:   const int64_t capacity_;
> If I understand correctly, this function only gets called once the heap has
Done. Updated comment and added DCHECK.


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.cc
File be/src/exec/topn-node.cc:

http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.cc@306
PS11, Line 306: heap is at
> I think this is a typo? i.e. these two words should be removed
Done



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 13
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 16 Jan 2021 01:48:13 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-15 Thread Tim Armstrong (Code Review)
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#13).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,276 insertions(+), 354 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/13
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 13
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-15 Thread Tim Armstrong (Code Review)
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#12).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,275 insertions(+), 354 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/12
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 12
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-15 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 11:

(10 comments)

I've only looked at the BE part and tests so far, but I can try to wrap my head 
around the FE changes too if its helpful.

I think the BE logic looks good, but since its complex it took me awhile to 
convince myself it was correct, so most of my suggestions are around cleaning 
up comments

http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc
File be/src/exec/topn-node-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@44
PS11, Line 44: priority_queue_.Push(insert_tuple);
Might be useful to explicitly mention here that we don't have to worry about 
ties going into the heap at this point, as we'll handle that with the 
complicated logic in InsertTupleWithTieHandling later if necessary.


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@91
PS11, Line 91: // 'materialized_tuple' needs to be added. Figure out which 
other tuples, if any,
 : // need to be removed from the heap.
I might move this above the DCHECK, to make it clearer its referring to the 
entire 'else' branch.


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@93
PS11, Line 93: It will compare equal with any overflowed ties at the back of
 : // 'priority_queue_'
Not sure what this is supposed to mean, since my understanding would be that 
'overflowed' ties aren't in 'priority_queue_'.


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@97
PS11, Line 97: 'top_tuple'
Might be worth explicitly saying that 'top_tuple' is what we just popped off 
the heap.


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node-ir.cc@100
PS11, Line 100:   // Removing all the ties would bring the heap below its 
capacity. We need to
I find this comment a little confusing, since we've already done the one Pop() 
we need to do, its unclear why we would think about doing another Pop() and 
bring the heap below capacity just because the new top is tied with the old one.

I think it would be better to say something along the lines of "The new top is 
still tied with the tuples in 'overflowed_ties_' so don't clear it, just add 
the previous top as another overflowed tuple"


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h
File be/src/exec/topn-node.h:

http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@86
PS11, Line 86: /// Unpartitioned TopN Implementation Details
nit: its a little confusing to call out 'unpartitioned mode' specifically when 
its the only mode currently supported. I guess it doesn't matter if the 
partitioned mode patch is going in soon, but you could add a "TODO: implement 
partitioned mode" or something.


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@180
PS11, Line 180:   RuntimeProfile::Counter* num_partitions_counter_ = nullptr;
The counters here and below aren't used in this patch


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@200
PS11, Line 200: GetNextPartitioned()
nit: not in this patch


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.h@256
PS11, Line 256:   /// Helper to insert tuple row into the priority queue with 
tie handling.
If I understand correctly, this function only gets called once the heap has 
reached capacity. I suppose that's sort of implicit, since that's when tie 
handling starts to matter, but might be useful to call this out explicitly.


http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.cc
File be/src/exec/topn-node.cc:

http://gerrit.cloudera.org:8080/#/c/16942/11/be/src/exec/topn-node.cc@306
PS11, Line 306: addition to
I think this is a typo? i.e. these two words should be removed



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 16 Jan 2021 00:12:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 11:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8000/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 14 Jan 2021 07:02:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7997/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 10
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 14 Jan 2021 06:53:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java@283
PS11, Line 283: (includeTies_ && limitWithTies_ >= 0), "Top-N must 
have limit", debugString());
line too long (91 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 14 Jan 2021 06:40:35 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Tim Armstrong (Code Review)
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#11).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,285 insertions(+), 353 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/11
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16942/10/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/10/fe/src/main/java/org/apache/impala/planner/SortNode.java@283
PS10, Line 283: (includeTies_ && limitWithTies_ >= 0), "Top-N must 
have limit", debugString());
line too long (91 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 10
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 14 Jan 2021 06:32:36 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Tim Armstrong (Code Review)
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#10).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,278 insertions(+), 348 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/10
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 10
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 9:

Thanks for fixing this important scenario. I will try and review the planner 
changes in the next day or so.


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 9
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 14 Jan 2021 01:16:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7993/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 9
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 14 Jan 2021 01:11:14 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16942/8/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/8/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@585
PS8, Line 585: if (!(pred.getOp() == BinaryPredicate.Operator.EQ ||
> I think we need to adjust the handling of EQ here, I don't think we can pus
Done



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 8
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 14 Jan 2021 00:49:04 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Tim Armstrong (Code Review)
Hello Aman Sinha, Thomas Tauber-Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#9).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,232 insertions(+), 344 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/9
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 9
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16942/8/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/8/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@585
PS8, Line 585: if (!(pred.getOp() == BinaryPredicate.Operator.EQ ||
I think we need to adjust the handling of EQ here, I don't think we can push 
when the value is >= 1 since the predicate will filter out an arbitrary number 
of rows (maybe you can be clever in some cases, but I think that's generally 
true). E.g.

select * from (
  select d_date, i_item_id, ss_list_price,
rank() over (partition by d_date order by ss_list_price desc) rnk
  from store_sales ss
  join item i on ss_item_sk = i_item_sk
  join date_dim d on ss_sold_date_sk = d_date_sk
  where ss_list_price is not null) v
where rnk = 100
order by d_date, i_item_id
limit 50



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 8
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 13 Jan 2021 22:51:10 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 8:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7989/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 8
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 13 Jan 2021 18:19:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-13 Thread Tim Armstrong (Code Review)
Hello Aman Sinha, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#8).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,143 insertions(+), 343 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/8
-- 
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 8
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7988/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 7
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 13 Jan 2021 05:35:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16942/7/tests/query_test/test_limit_pushdown_analytic.py
File tests/query_test/test_limit_pushdown_analytic.py:

http://gerrit.cloudera.org:8080/#/c/16942/7/tests/query_test/test_limit_pushdown_analytic.py@25
PS7, Line 25: class TestLimitPushdownAnalyticTpch(ImpalaTestSuite):
flake8: E302 expected 2 blank lines, found 1



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 7
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 13 Jan 2021 05:14:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-12 Thread Tim Armstrong (Code Review)
Hello Aman Sinha, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#7).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira
* Add an end-to-end test on TPC-H with more data to exercise the
  tie-handling logic in the execnode more.

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M testdata/workloads/tpch/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
16 files changed, 1,143 insertions(+), 344 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/7
-- 
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 7
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 6:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7987/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 6
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 13 Jan 2021 03:25:03 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16942/6/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/6/fe/src/main/java/org/apache/impala/planner/SortNode.java@158
PS6, Line 158:* This does the conversion to top-n if the converted node 
would pass the TOPN_BYTES_LIMIT
line too long (92 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 6
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 13 Jan 2021 03:03:50 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-12 Thread Tim Armstrong (Code Review)
Hello Aman Sinha, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#6).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira

TODO:
* Add some end-to-end tests that stress tie-handling more

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

Ran TPC-H scale 30 on a single node, no significant perf change.

Ran a targeted query to check for regressions in the top-n node.
The for this targeted query did not change:

  use tpch30_parquet;
  set mt_dop=1;
  select l_extendedprice from lineitem
  order by 1 limit 100

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
15 files changed, 993 insertions(+), 341 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/6
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 6
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7983/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 13 Jan 2021 02:47:48 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16942/5/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/5/fe/src/main/java/org/apache/impala/planner/SortNode.java@158
PS5, Line 158:* This does the conversion to top-n if the converted node 
would pass the TOPN_BYTES_LIMIT
line too long (92 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 13 Jan 2021 02:26:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-12 Thread Tim Armstrong (Code Review)
Hello Aman Sinha, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#5).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

This also adds a check against TOPN_BYTES_LIMIT so that
the limit can't be increased to an arbitarily large value.

Testing:
* Add new planner test with negative case where it's rejected
  because the transformation is incorrect.
* Update other planner tests to reflect new limit calculation
  + tie handling required for correctness.
* Add planner test for very high rank predicate that overflows int32
* Add planner test that checks TOPN_BYTES_LIMIT handling
* Add planner test that checks that dense_rank() can't be pushed.
* Existing planner tests already have adequate coverage for predicates
  : <=, <, = and row_number().
* Add some end-to-end tests that repro bugs that fall under the jira

TODO:
* Add some end-to-end tests that stress tie-handling more

Perf:
Ran TPC-DS q67 with mt_dop=1 on a single node, confirmed there was
no measurable change in performance as a result of this patchset.

TODO: check for changes in regular top-n performance.

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
15 files changed, 990 insertions(+), 341 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/5
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7981/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 12 Jan 2021 20:14:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7980/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 12 Jan 2021 20:09:11 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-12 Thread Tim Armstrong (Code Review)
Hello Aman Sinha, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16942

to look at the new patch set (#4).

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

TODO: apply top-n threshold to avoid increasing limit enormously

Testing:
* Add new planner test with negative case where it's rejected
* Update other planner tests to reflect new limits + tie handling
* Add some end-to-end tests that repro bugs
TODO:
* Add planner tests for edge cases with limits - =, <, etc
* Add planner test for row_number()
* Add planner test for very high rank predicate that overflows int
* Add some end-to-end tests that stress tie-handling more
* test perf - q67

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
15 files changed, 790 insertions(+), 329 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/4
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 3:

I finally got around to fixing this - it required grabbing the tie handling 
logic from https://gerrit.cloudera.org/#/c/16242/ to fix it correctly.

I still need to add some additional tests, noted in the commit message, but I 
thought it was in a good state to get feedback on the approach


--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 12 Jan 2021 19:47:39 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#3). ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..

IMPALA-10296: Fix analytic limit pushdown when predicates are present

This fixes the analytic push down optimization for the case when where
the ORDER BY expressions are compatible with the partitioning of the
analytic *and* there is a rank() or row_number() predicate.

In this case the rows returned are going to come from the first partitions,
i.e. if the limit is 100, if we go through the partitions in order until
the row count adds up to 100, then we know that the rows must come from
those partitions.

The problem is that predicates can discard rows from the partitions,
meaning that a limit naively pushed down to the top-n will filter
out rows that could be returned from the query.

We can avoid the problem in the case where the partition limit >= limit
order by limit, however.

In this case the relevant set of partitions is the set of partitions
that include the first  rows, since the top-level limit
generally kicks in before the per-partition limit. The only twist
is that the orderings may be different within a partition, so we
need to make sure to include all of the rows in the final partition.

The solution implemented in this patch is to increase the pushed
down limit so that it was always guaranteed to include all of the
rows in the final partition to be returned. E.g. if you had a
row_number() <= 100 predicate and limit 100, if you pushed down
limit 200, then you'd be guaranteed to capture all of the rows
in the final partition. One case we need to handle is that,
in the case of a rank() predicate, we can have more than that
number of rows in the partition because of ties.

This patch implements tie handling in the backend (I took most
of that implementation from my in-progress partitioned top-n patch,
with the intention of rebasing that onto this patch).

TODO: apply top-n threshold to avoid increasing limit enormously

Testing:
* Add new planner test with negative case where it's rejected
* Update other planner tests to reflect new limits + tie handling
* Add some end-to-end tests that repro bugs
TODO:
* Add planner tests for edge cases with limits - =, <, etc
* Add planner test for row_number()
* Add planner test for very high rank predicate that overflows int
* Add some end-to-end tests that stress tie-handling more
* test perf - q67

Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
---
M be/src/exec/topn-node-ir.cc
M be/src/exec/topn-node.cc
M be/src/exec/topn-node.h
M be/src/util/tuple-row-compare.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/AnalyticExpr.java
M fe/src/main/java/org/apache/impala/analysis/AnalyticWindow.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
M testdata/workloads/functional-planner/queries/PlannerTest/tpcds/tpcds-q67.test
A testdata/workloads/functional-query/queries/limit-pushdown-analytic.test
M tests/query_test/test_limit_pushdown_analytic.py
17 files changed, 798 insertions(+), 335 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/42/16942/3
--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong