[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 16: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 16
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 21 Apr 2023 15:22:16 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by 
Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on IDENTITY-partition columns then Iceberg
can filter the files and using these filters in Impala wouldn't filter
any more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

Another additional benefit comes with count(*) queries. If all the
predicates are skipped from being pushed to Impala's scanner for a
count(*) query then the Parquet scanner can go to an optimized path
where it uses stats instead of reading actual data to answer the query.

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
for predicates being pushed down to the scanner, but with this
patch not all of them are pushed down. For these tests I added some
extra predicates to achieve that all of the predicates are pushed
down to the scanner.
  - Added a new planner test suite for checking how predicate push down
works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Reviewed-on: http://gerrit.cloudera.org:8080/19534
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 331 insertions(+), 75 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 17
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 16:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9242/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 16
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 21 Apr 2023 10:10:07 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 16: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 16
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 21 Apr 2023 10:10:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 15: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/9238/


--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 15
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 20 Apr 2023 18:03:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 15: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 15
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 20 Apr 2023 12:43:20 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 15:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9238/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 15
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 20 Apr 2023 12:43:21 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-20 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 14: Code-Review+2

LGTM!


--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 14
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 20 Apr 2023 12:18:46 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 14:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12827/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 14
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 20 Apr 2023 09:22:20 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-20 Thread Gabor Kaszab (Code Review)
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#14).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by 
Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on IDENTITY-partition columns then Iceberg
can filter the files and using these filters in Impala wouldn't filter
any more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

Another additional benefit comes with count(*) queries. If all the
predicates are skipped from being pushed to Impala's scanner for a
count(*) query then the Parquet scanner can go to an optimized path
where it uses stats instead of reading actual data to answer the query.

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
for predicates being pushed down to the scanner, but with this
patch not all of them are pushed down. For these tests I added some
extra predicates to achieve that all of the predicates are pushed
down to the scanner.
  - Added a new planner test suite for checking how predicate push down
works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 331 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/14
--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 14
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 13:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12815/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 13
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 18 Apr 2023 15:40:41 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-18 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 13:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test:

http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@63
PS12, Line 63: #
> nit: did you want to add some comments here?
Done


http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@64
PS12, Line 64: # optimization kicks in for Parquet scanners.
> Are we sure we get Parquet count(*) optimization here?
You're right. When I ran this query from shell it worked as expected and I got 
a similar profile you sent. However, when running in PlannerTests this went 
wrong and I got the profile you see in the tests.

It turned out that this is because initially HdfsScanNOde.fileFormats_ contains 
"PARQUET" and "ICEBERG" and then this is updated to "PARQUET" only. However, 
the PlannerTests don't do this update step and won't do the count(*) 
optimization as it's only for single-FF tables.

I made a small adjustment to not store "ICEBERG" in HdfsScanNode.fileFormats_. 
Apparently, this solved the issue. Hope won't cause any regression but a full 
teat suite will show.



--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 13
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 18 Apr 2023 15:27:46 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-18 Thread Gabor Kaszab (Code Review)
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#13).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by 
Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on IDENTITY-partition columns then Iceberg
can filter the files and using these filters in Impala wouldn't filter
any more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

Another additional benefit comes with count(*) queries. If all the
predicates are skipped from being pushed to Impala's scanner for a
count(*) query then the Parquet scanner can go to an optimized path
where it uses stats instead of reading actual data to answer the query.

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
for predicates being pushed down to the scanner, but with this
patch not all of them are pushed down. For these tests I added some
extra predicates to achieve that all of the predicates are pushed
down to the scanner.
  - Added a new planner test suite for checking how predicate push down
works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 328 insertions(+), 72 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/13
--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 13
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-14 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 12:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test:

http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@63
PS12, Line 63: #
nit: did you want to add some comments here?


http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@64
PS12, Line 64: SELECT COUNT(*) FROM functional_parquet.iceberg_partitioned 
WHERE action = 'click';
Are we sure we get Parquet count(*) optimization here?

Because I see the following plan in such a case:

  Query: explain SELECT COUNT(*) FROM functional_parquet.iceberg_partitioned 
where true
 
+--+
 | Explain String   
|
 
+--+
 | Max Per-Host Resource Reservation: Memory=8.00KB Threads=2   
|
 | Per-Host Resource Estimates: Memory=10MB 
|
 | Codegen disabled by planner  
|
 |  
|
 | PLAN-ROOT SINK   
|
 | |
|
 | 01:AGGREGATE [FINALIZE]  
|
 | |  output: sum_init_zero(functional_parquet.iceberg_partitioned.stats: 
num_rows) |
 | |  row-size=8B cardinality=1 
|
 | |
|
 | 00:SCAN HDFS [functional_parquet.iceberg_partitioned]
|
 |HDFS partitions=1/1 files=20 size=22.90KB 
|
 |row-size=8B cardinality=20
|
 
+--+

Please note the sum_init_zero aggregate function, plus the row-size of SCAN 
HDFS is not 0.

I get a similar plan to yours when I run a select count(*) query on a plain 
text table that doesn't have this optimization, e.g.: explain SELECT COUNT(*) 
FROM functional.alltypes;



--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 12
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 14 Apr 2023 13:30:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 11:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12798/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 11
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 14 Apr 2023 07:46:15 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-14 Thread Gabor Kaszab (Code Review)
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#12).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by 
Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on IDENTITY-partition columns then Iceberg
can filter the files and using these filters in Impala wouldn't filter
any more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

Another additional benefit comes with count(*) queries. If all the
predicates are skipped from being pushed to Impala's scanner for a
count(*) query then the Parquet scanner can go to an optimized path
where it uses stats instead of reading actual data to answer the query.

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
for predicates being pushed down to the scanner, but with this
patch not all of them are pushed down. For these tests I added some
extra predicates to achieve that all of the predicates are pushed
down to the scanner.
  - Added a new planner test suite for checking how predicate push down
works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 341 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/12
--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 12
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-14 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 9:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@20
PS8, Line 20: materialize less slots
> We don't have the query rewrite yet, but probably we still get the Parquet
You're right, the Parquet scanner goes into the "count * optimisation path" and 
the query is answered from stats if there are no predicates pushed down for a 
count * query. I added a test to cover this.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64: super(id, tblRef.getDesc(), conjuncts,
> Though if we have the following case:
Ahh, good point. No worries, I'll re-retire 'nonIdentityConjuncts_' then :)


http://gerrit.cloudera.org:8080/#/c/19534/9/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/9/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@503
PS9, Line 503: anyMatch
> Nit: instead of negating anyMatch, you could use noneMatch.
Done
Update: Anyway, I restored this function to the original version so this is not 
relevant anymore.



--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 14 Apr 2023 07:26:00 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-04-14 Thread Gabor Kaszab (Code Review)
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#11).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by 
Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on IDENTITY-partition columns then Iceberg
can filter the files and using these filters in Impala wouldn't filter
any more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
for predicates being pushed down to the scanner, but with this
patch not all of them are pushed down. For these tests I added some
extra predicates to achieve that all of the predicates are pushed
down to the scanner.
  - Added a new planner test suite for checking how predicate push down
works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 341 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/11
--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 11
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-31 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64:   private List skippedConjuncts;
> I dropped it then.
Though if we have the following case:

 part_id_col = 42 AND non_part_col < 10

(let's assume table is IDENTITY-partitioned by part_id_col)

Then Iceberg is able to filter out partitions via part_id_col = 42, but we 
cannot skip evaluating part_id_col = 42 in the scanners because there is still 
a residual expression (non_part_col < 10).

And now, if 'part_id_col = 42' will still be used to calculate the cardinality 
estimates, and it will probably cause very bad estimations.

So maybe we can only retire 'nonIdentityConjuncts_' when have a better logic to 
determine which predicate is part of the Iceberg residual and which isn't.

Sorry for the inconvenience.



--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 31 Mar 2023 09:54:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-31 Thread Daniel Becker (Code Review)
Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 9: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@175
PS8, Line 175: h(ge
> Unrelated to this patch and most probably would require fixing tests as wel
Ok, I probably looked at it when I compared different versions with a rebase 
between them.


http://gerrit.cloudera.org:8080/#/c/19534/9/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/9/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@503
PS9, Line 503: anyMatch
Nit: instead of negating anyMatch, you could use noneMatch.



--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 31 Mar 2023 09:18:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-30 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 9: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@20
PS8, Line 20: materialize less slots
> That count * optimization to get rid of the SCAN nodes when we eliminate al
We don't have the query rewrite yet, but probably we still get the Parquet 
count-star optimization, i.e. a count-star slot is created and the scanners 
only read the footer, they create a single tuple per Parquet row group which 
holds the row count, i.e. no need to read the Parquet def/rep-levels, etc.

Or what does the EXPLAIN say? Can we add a planner test for count(*)?



--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 30 Mar 2023 14:50:48 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-30 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 9:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@14
PS8, Line 14: IDENTITY-partitio
> nit: IDENTITY-partition columns?
Done


http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@20
PS8, Line 20: materialize less slots
> And in this example we not just materialize less slots, but we also get cou
That count * optimization to get rid of the SCAN nodes when we eliminate all 
the predicates by this change is coming in "Part2"


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64: super(id, tblRef.getDesc(), conjuncts,
> Yeah, I think 'nonIdentityConjuncts_' can retire with this change, and it's
I dropped it then.


http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@175
PS8, Line 175: h(ge
> We could write "file(s)".
Unrelated to this patch and most probably would require fixing tests as well. 
We can do this in a separate ticket.


http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@371
PS8, Line 371: ses;
> See L175.
see my above comment.



--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 30 Mar 2023 14:36:15 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12716/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 30 Mar 2023 14:15:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-30 Thread Gabor Kaszab (Code Review)
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#9).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by 
Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on IDENTITY-partition columns then Iceberg
can filter the files and using these filters in Impala wouldn't filter
any more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
for predicates being pushed down to the scanner, but with this
patch not all of them are pushed down. For these tests I added some
extra predicates to achieve that all of the predicates are pushed
down to the scanner.
  - Added a new planner test suite for checking how predicate push down
works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 313 insertions(+), 113 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/9
--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-29 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 8: Code-Review+1

(4 comments)

LGTM!

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@14
PS8, Line 14: partition columns
nit: IDENTITY-partition columns?


http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@20
PS8, Line 20: materialize less slots
And in this example we not just materialize less slots, but we also get 
count-star optimization.


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64:
> 'nonIdentityConjuncts_' is in fact a subset of 'conjuncts_' that doesn't in
Yeah, I think 'nonIdentityConjuncts_' can retire with this change, and it's 
more precise to use 'conjuncts_', as it is already filtered by 
'IcebergScanPlanner.filterConjuncts()'.


http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test:

http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@990
PS3, Line 990: event_time='2020-01-01 11:00:00';
> I had to check, but apparently we push down predicates of an HOUR() partiti
I see, thanks for checking.



--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 8
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 29 Mar 2023 13:08:46 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-29 Thread Daniel Becker (Code Review)
Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 8: Code-Review+1

(2 comments)

Looks good to me.

http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@175
PS8, Line 175: file
We could write "file(s)".


http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@371
PS8, Line 371: file
See L175.



--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 8
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 29 Mar 2023 13:01:09 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12704/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 29 Mar 2023 07:29:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-29 Thread Gabor Kaszab (Code Review)
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#7).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by 
Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on partition columns then Iceberg can
filter the files and using these filters in Impala wouldn't filter any
more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
for predicates being pushed down to the scanner, but with this
patch not all of them are pushed down. For these tests I added some
extra predicates to achieve that all of the predicates are pushed
down to the scanner.
  - Added a new planner test suite for checking how predicate push down
works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 309 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/7
--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 6:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12701/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 28 Mar 2023 16:09:18 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12700/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 28 Mar 2023 15:57:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-28 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19534 )

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..


Patch Set 6:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@16
PS1, Line 16: t (ass
> Nit: filters.
Done


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@17
PS1, Line 17: ).
> Wouldn't "assuming" be better?
Done


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@17
PS1, Line 17:
> Nit: closing parenthesis should come at the end of the sentence.
Done


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@20
PS1, Line 20: scanner is that we can potentially materialize les
> Am I right that here we mean pushing down predicates to the scanner, as opp
Done


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@20
PS1, Line 20: tentiall
> not pushing down
Done


http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@28
PS1, Line 28:
> We could include a Testing section describing what tests have been added/mo
Done


http://gerrit.cloudera.org:8080/#/c/19534/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19534/3//COMMIT_MSG@20
PS3, Line 20: iall
> nit: pushing down?
Done


http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1390
PS3, Line 1390: to
> 'is' is not needed here?
Done


http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2027
PS3, Line 2027: indentPrefix
> 'explainLevel' may not be the best name as it could be confused with the ve
Done


http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java:

http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64
PS1, Line 64:
> The comment on 'nonIdentityConjuncts_' says that it is a subset of 'conjunc
'nonIdentityConjuncts_' is in fact a subset of 'conjuncts_' that doesn't 
include conjuncts on identity partitioned columns.
'skippedConjuncts_' indeed contains identity conjuncts but not necessarily all 
of them just the ones that won't filter any further rows.

OTOH As I see 'nonIdentityConjuncts_' is used for cardinality estimates. 
However, I wonder if from this point on can we drop use 'conjuncts_' - 
'skippedConjuncts_' for this purpose.
@Zoltan, do you have a thought on this?


http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@115
PS3, Line 115: has
> Nit: has
Done


http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@241
PS3, Line 241:
> I don't think we need this as this is the DELETE SCAN node which wouldn't e
Done


http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@388
PS3, Line 388: : %s", getIceTable().
> nit: could be 'Collections.emptyList()'
Done


http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/datasets/functional/functional_schema_template.sql@3265
PS3, Line 3265:  DEPENDENT_LOAD
  : # We can use 'date_string_col' as it is once IMPALA-11954 is 
done.
  : INSERT INTO {db_name}{db_suffix}.iceberg_partition_evolution
> Since we are writing the table via Impala we probably don't need to set the
Indeed, done


http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test:

http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@49
PS3, Line 49: could no
> Could not be?
Indeed :)


http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test:

http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@990
PS3, Line 990: event_time='2020-01-01 11:00:00';
> iceberg_partitioned is partitioned by HOUR(event_time), how c

[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-28 Thread Gabor Kaszab (Code Review)
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#6).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by 
Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on partition columns then Iceberg can
filter the files and using these filters in Impala wouldn't filter any
more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
for predicates being pushed down to the scanner, but with this
patch not all of them are pushed down. For these tests I added some
extra predicates to achieve that all of the predicates are pushed
down to the scanner.
  - Added a new planner test suite for checking how predicate push down
works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 311 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/6
--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg

2023-03-28 Thread Gabor Kaszab (Code Review)
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/19534

to look at the new patch set (#5).

Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if 
already applied by Iceberg
..

IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by 
Iceberg

We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on partition columns then Iceberg can
filter the files and using these filters in Impala wouldn't filter any
more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
for predicates being pushed down to the scanner, but with this
patch not all of them are pushed down. For these tests I added some
extra predicates to achieve that all of the predicates are pushed
down to the scanner.
  - Added a new planner test suite for checking how predicate push down
works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
---
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test
12 files changed, 311 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/5
--
To view, visit http://gerrit.cloudera.org:8080/19534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Gerrit-Change-Number: 19534
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy