[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 16: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 16 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 21 Apr 2023 15:22:16 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg We push down predicates to Iceberg that uses them to filter out files when getting the results of planFiles(). Using the FileScanTask.residual() function we can find out if we have to use the predicates to further filter the rows of the given files or if Iceberg has already performed all the filtering. Basically if we only filter on IDENTITY-partition columns then Iceberg can filter the files and using these filters in Impala wouldn't filter any more rows from the output (assuming that no partition evolution was performed on the table). An additional benefit of not pushing down no-op predicates to the scanner is that we can potentially materialize less slots. For example: SELECT count(1) from iceberg_tbl where part_col = 10; Another additional benefit comes with count(*) queries. If all the predicates are skipped from being pushed to Impala's scanner for a count(*) query then the Parquet scanner can go to an optimized path where it uses stats instead of reading actual data to answer the query. In the above query Iceberg filters the files using the predicate on a partition column and then there won't be any need to materialize 'part_col' in Impala, nor to push down the 'part_col = 10' predicate. Note, this is an all or nothing approach, meaning that assuming N number of predicates we either push down all predicates to the scanner or none of them. There is a room for improvement to identify a subset of the predicates that we still have to push down to the scanner. However, for this we'd need a mapping between Impala predicates and the predicates returned by Iceberg's FileScanTask.residual() function that would significantly increase the complexity of the relevant code. Testing: - Some existing tests needed some extra care as they were checking for predicates being pushed down to the scanner, but with this patch not all of them are pushed down. For these tests I added some extra predicates to achieve that all of the predicates are pushed down to the scanner. - Added a new planner test suite for checking how predicate push down works with Iceberg tables. Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Reviewed-on: http://gerrit.cloudera.org:8080/19534 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins --- M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test 12 files changed, 331 insertions(+), 75 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 17 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 16: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9242/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 16 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 21 Apr 2023 10:10:07 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 16: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 16 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 21 Apr 2023 10:10:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 15: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/9238/ -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 15 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 20 Apr 2023 18:03:53 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 15: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 15 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 20 Apr 2023 12:43:20 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 15: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9238/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 15 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 20 Apr 2023 12:43:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 14: Code-Review+2 LGTM! -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 14 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 20 Apr 2023 12:18:46 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 14: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/12827/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 14 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 20 Apr 2023 09:22:20 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19534 to look at the new patch set (#14). Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg We push down predicates to Iceberg that uses them to filter out files when getting the results of planFiles(). Using the FileScanTask.residual() function we can find out if we have to use the predicates to further filter the rows of the given files or if Iceberg has already performed all the filtering. Basically if we only filter on IDENTITY-partition columns then Iceberg can filter the files and using these filters in Impala wouldn't filter any more rows from the output (assuming that no partition evolution was performed on the table). An additional benefit of not pushing down no-op predicates to the scanner is that we can potentially materialize less slots. For example: SELECT count(1) from iceberg_tbl where part_col = 10; Another additional benefit comes with count(*) queries. If all the predicates are skipped from being pushed to Impala's scanner for a count(*) query then the Parquet scanner can go to an optimized path where it uses stats instead of reading actual data to answer the query. In the above query Iceberg filters the files using the predicate on a partition column and then there won't be any need to materialize 'part_col' in Impala, nor to push down the 'part_col = 10' predicate. Note, this is an all or nothing approach, meaning that assuming N number of predicates we either push down all predicates to the scanner or none of them. There is a room for improvement to identify a subset of the predicates that we still have to push down to the scanner. However, for this we'd need a mapping between Impala predicates and the predicates returned by Iceberg's FileScanTask.residual() function that would significantly increase the complexity of the relevant code. Testing: - Some existing tests needed some extra care as they were checking for predicates being pushed down to the scanner, but with this patch not all of them are pushed down. For these tests I added some extra predicates to achieve that all of the predicates are pushed down to the scanner. - Added a new planner test suite for checking how predicate push down works with Iceberg tables. Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b --- M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test 12 files changed, 331 insertions(+), 75 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/14 -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 14 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 13: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/12815/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 13 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 18 Apr 2023 15:40:41 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 13: (2 comments) http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test: http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@63 PS12, Line 63: # > nit: did you want to add some comments here? Done http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@64 PS12, Line 64: # optimization kicks in for Parquet scanners. > Are we sure we get Parquet count(*) optimization here? You're right. When I ran this query from shell it worked as expected and I got a similar profile you sent. However, when running in PlannerTests this went wrong and I got the profile you see in the tests. It turned out that this is because initially HdfsScanNOde.fileFormats_ contains "PARQUET" and "ICEBERG" and then this is updated to "PARQUET" only. However, the PlannerTests don't do this update step and won't do the count(*) optimization as it's only for single-FF tables. I made a small adjustment to not store "ICEBERG" in HdfsScanNode.fileFormats_. Apparently, this solved the issue. Hope won't cause any regression but a full teat suite will show. -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 13 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 18 Apr 2023 15:27:46 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19534 to look at the new patch set (#13). Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg We push down predicates to Iceberg that uses them to filter out files when getting the results of planFiles(). Using the FileScanTask.residual() function we can find out if we have to use the predicates to further filter the rows of the given files or if Iceberg has already performed all the filtering. Basically if we only filter on IDENTITY-partition columns then Iceberg can filter the files and using these filters in Impala wouldn't filter any more rows from the output (assuming that no partition evolution was performed on the table). An additional benefit of not pushing down no-op predicates to the scanner is that we can potentially materialize less slots. For example: SELECT count(1) from iceberg_tbl where part_col = 10; Another additional benefit comes with count(*) queries. If all the predicates are skipped from being pushed to Impala's scanner for a count(*) query then the Parquet scanner can go to an optimized path where it uses stats instead of reading actual data to answer the query. In the above query Iceberg filters the files using the predicate on a partition column and then there won't be any need to materialize 'part_col' in Impala, nor to push down the 'part_col = 10' predicate. Note, this is an all or nothing approach, meaning that assuming N number of predicates we either push down all predicates to the scanner or none of them. There is a room for improvement to identify a subset of the predicates that we still have to push down to the scanner. However, for this we'd need a mapping between Impala predicates and the predicates returned by Iceberg's FileScanTask.residual() function that would significantly increase the complexity of the relevant code. Testing: - Some existing tests needed some extra care as they were checking for predicates being pushed down to the scanner, but with this patch not all of them are pushed down. For these tests I added some extra predicates to achieve that all of the predicates are pushed down to the scanner. - Added a new planner test suite for checking how predicate push down works with Iceberg tables. Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b --- M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test 12 files changed, 328 insertions(+), 72 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/13 -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 13 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 12: (2 comments) http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test: http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@63 PS12, Line 63: # nit: did you want to add some comments here? http://gerrit.cloudera.org:8080/#/c/19534/12/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@64 PS12, Line 64: SELECT COUNT(*) FROM functional_parquet.iceberg_partitioned WHERE action = 'click'; Are we sure we get Parquet count(*) optimization here? Because I see the following plan in such a case: Query: explain SELECT COUNT(*) FROM functional_parquet.iceberg_partitioned where true +--+ | Explain String | +--+ | Max Per-Host Resource Reservation: Memory=8.00KB Threads=2 | | Per-Host Resource Estimates: Memory=10MB | | Codegen disabled by planner | | | | PLAN-ROOT SINK | | | | | 01:AGGREGATE [FINALIZE] | | | output: sum_init_zero(functional_parquet.iceberg_partitioned.stats: num_rows) | | | row-size=8B cardinality=1 | | | | | 00:SCAN HDFS [functional_parquet.iceberg_partitioned] | |HDFS partitions=1/1 files=20 size=22.90KB | |row-size=8B cardinality=20 | +--+ Please note the sum_init_zero aggregate function, plus the row-size of SCAN HDFS is not 0. I get a similar plan to yours when I run a select count(*) query on a plain text table that doesn't have this optimization, e.g.: explain SELECT COUNT(*) FROM functional.alltypes; -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 12 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 14 Apr 2023 13:30:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 11: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/12798/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 11 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 14 Apr 2023 07:46:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19534 to look at the new patch set (#12). Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg We push down predicates to Iceberg that uses them to filter out files when getting the results of planFiles(). Using the FileScanTask.residual() function we can find out if we have to use the predicates to further filter the rows of the given files or if Iceberg has already performed all the filtering. Basically if we only filter on IDENTITY-partition columns then Iceberg can filter the files and using these filters in Impala wouldn't filter any more rows from the output (assuming that no partition evolution was performed on the table). An additional benefit of not pushing down no-op predicates to the scanner is that we can potentially materialize less slots. For example: SELECT count(1) from iceberg_tbl where part_col = 10; Another additional benefit comes with count(*) queries. If all the predicates are skipped from being pushed to Impala's scanner for a count(*) query then the Parquet scanner can go to an optimized path where it uses stats instead of reading actual data to answer the query. In the above query Iceberg filters the files using the predicate on a partition column and then there won't be any need to materialize 'part_col' in Impala, nor to push down the 'part_col = 10' predicate. Note, this is an all or nothing approach, meaning that assuming N number of predicates we either push down all predicates to the scanner or none of them. There is a room for improvement to identify a subset of the predicates that we still have to push down to the scanner. However, for this we'd need a mapping between Impala predicates and the predicates returned by Iceberg's FileScanTask.residual() function that would significantly increase the complexity of the relevant code. Testing: - Some existing tests needed some extra care as they were checking for predicates being pushed down to the scanner, but with this patch not all of them are pushed down. For these tests I added some extra predicates to achieve that all of the predicates are pushed down to the scanner. - Added a new planner test suite for checking how predicate push down works with Iceberg tables. Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b --- M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test 12 files changed, 341 insertions(+), 71 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/12 -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 12 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 9: (3 comments) http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@20 PS8, Line 20: materialize less slots > We don't have the query rewrite yet, but probably we still get the Parquet You're right, the Parquet scanner goes into the "count * optimisation path" and the query is answered from stats if there are no predicates pushed down for a count * query. I added a test to cover this. http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java: http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64 PS1, Line 64: super(id, tblRef.getDesc(), conjuncts, > Though if we have the following case: Ahh, good point. No worries, I'll re-retire 'nonIdentityConjuncts_' then :) http://gerrit.cloudera.org:8080/#/c/19534/9/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java: http://gerrit.cloudera.org:8080/#/c/19534/9/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@503 PS9, Line 503: anyMatch > Nit: instead of negating anyMatch, you could use noneMatch. Done Update: Anyway, I restored this function to the original version so this is not relevant anymore. -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 9 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 14 Apr 2023 07:26:00 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19534 to look at the new patch set (#11). Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg We push down predicates to Iceberg that uses them to filter out files when getting the results of planFiles(). Using the FileScanTask.residual() function we can find out if we have to use the predicates to further filter the rows of the given files or if Iceberg has already performed all the filtering. Basically if we only filter on IDENTITY-partition columns then Iceberg can filter the files and using these filters in Impala wouldn't filter any more rows from the output (assuming that no partition evolution was performed on the table). An additional benefit of not pushing down no-op predicates to the scanner is that we can potentially materialize less slots. For example: SELECT count(1) from iceberg_tbl where part_col = 10; In the above query Iceberg filters the files using the predicate on a partition column and then there won't be any need to materialize 'part_col' in Impala, nor to push down the 'part_col = 10' predicate. Note, this is an all or nothing approach, meaning that assuming N number of predicates we either push down all predicates to the scanner or none of them. There is a room for improvement to identify a subset of the predicates that we still have to push down to the scanner. However, for this we'd need a mapping between Impala predicates and the predicates returned by Iceberg's FileScanTask.residual() function that would significantly increase the complexity of the relevant code. Testing: - Some existing tests needed some extra care as they were checking for predicates being pushed down to the scanner, but with this patch not all of them are pushed down. For these tests I added some extra predicates to achieve that all of the predicates are pushed down to the scanner. - Added a new planner test suite for checking how predicate push down works with Iceberg tables. Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b --- M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test 12 files changed, 341 insertions(+), 71 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/11 -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 11 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java: http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64 PS1, Line 64: private List skippedConjuncts; > I dropped it then. Though if we have the following case: part_id_col = 42 AND non_part_col < 10 (let's assume table is IDENTITY-partitioned by part_id_col) Then Iceberg is able to filter out partitions via part_id_col = 42, but we cannot skip evaluating part_id_col = 42 in the scanners because there is still a residual expression (non_part_col < 10). And now, if 'part_id_col = 42' will still be used to calculate the cardinality estimates, and it will probably cause very bad estimations. So maybe we can only retire 'nonIdentityConjuncts_' when have a better logic to determine which predicate is part of the Iceberg residual and which isn't. Sorry for the inconvenience. -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 1 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 31 Mar 2023 09:54:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 9: Code-Review+1 (2 comments) http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java: http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@175 PS8, Line 175: h(ge > Unrelated to this patch and most probably would require fixing tests as wel Ok, I probably looked at it when I compared different versions with a rebase between them. http://gerrit.cloudera.org:8080/#/c/19534/9/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java: http://gerrit.cloudera.org:8080/#/c/19534/9/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@503 PS9, Line 503: anyMatch Nit: instead of negating anyMatch, you could use noneMatch. -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 9 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 31 Mar 2023 09:18:08 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 9: Code-Review+1 (1 comment) http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@20 PS8, Line 20: materialize less slots > That count * optimization to get rid of the SCAN nodes when we eliminate al We don't have the query rewrite yet, but probably we still get the Parquet count-star optimization, i.e. a count-star slot is created and the scanners only read the footer, they create a single tuple per Parquet row group which holds the row count, i.e. no need to read the Parquet def/rep-levels, etc. Or what does the EXPLAIN say? Can we add a planner test for count(*)? -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 9 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 30 Mar 2023 14:50:48 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 9: (5 comments) http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@14 PS8, Line 14: IDENTITY-partitio > nit: IDENTITY-partition columns? Done http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@20 PS8, Line 20: materialize less slots > And in this example we not just materialize less slots, but we also get cou That count * optimization to get rid of the SCAN nodes when we eliminate all the predicates by this change is coming in "Part2" http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java: http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64 PS1, Line 64: super(id, tblRef.getDesc(), conjuncts, > Yeah, I think 'nonIdentityConjuncts_' can retire with this change, and it's I dropped it then. http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java: http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@175 PS8, Line 175: h(ge > We could write "file(s)". Unrelated to this patch and most probably would require fixing tests as well. We can do this in a separate ticket. http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@371 PS8, Line 371: ses; > See L175. see my above comment. -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 9 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 30 Mar 2023 14:36:15 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/12716/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 9 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 30 Mar 2023 14:15:13 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19534 to look at the new patch set (#9). Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg We push down predicates to Iceberg that uses them to filter out files when getting the results of planFiles(). Using the FileScanTask.residual() function we can find out if we have to use the predicates to further filter the rows of the given files or if Iceberg has already performed all the filtering. Basically if we only filter on IDENTITY-partition columns then Iceberg can filter the files and using these filters in Impala wouldn't filter any more rows from the output (assuming that no partition evolution was performed on the table). An additional benefit of not pushing down no-op predicates to the scanner is that we can potentially materialize less slots. For example: SELECT count(1) from iceberg_tbl where part_col = 10; In the above query Iceberg filters the files using the predicate on a partition column and then there won't be any need to materialize 'part_col' in Impala, nor to push down the 'part_col = 10' predicate. Note, this is an all or nothing approach, meaning that assuming N number of predicates we either push down all predicates to the scanner or none of them. There is a room for improvement to identify a subset of the predicates that we still have to push down to the scanner. However, for this we'd need a mapping between Impala predicates and the predicates returned by Iceberg's FileScanTask.residual() function that would significantly increase the complexity of the relevant code. Testing: - Some existing tests needed some extra care as they were checking for predicates being pushed down to the scanner, but with this patch not all of them are pushed down. For these tests I added some extra predicates to achieve that all of the predicates are pushed down to the scanner. - Added a new planner test suite for checking how predicate push down works with Iceberg tables. Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b --- M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test 12 files changed, 313 insertions(+), 113 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/9 -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 9 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 8: Code-Review+1 (4 comments) LGTM! http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@14 PS8, Line 14: partition columns nit: IDENTITY-partition columns? http://gerrit.cloudera.org:8080/#/c/19534/8//COMMIT_MSG@20 PS8, Line 20: materialize less slots And in this example we not just materialize less slots, but we also get count-star optimization. http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java: http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64 PS1, Line 64: > 'nonIdentityConjuncts_' is in fact a subset of 'conjuncts_' that doesn't in Yeah, I think 'nonIdentityConjuncts_' can retire with this change, and it's more precise to use 'conjuncts_', as it is already filtered by 'IcebergScanPlanner.filterConjuncts()'. http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test: http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@990 PS3, Line 990: event_time='2020-01-01 11:00:00'; > I had to check, but apparently we push down predicates of an HOUR() partiti I see, thanks for checking. -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 8 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 29 Mar 2023 13:08:46 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 8: Code-Review+1 (2 comments) Looks good to me. http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java: http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@175 PS8, Line 175: file We could write "file(s)". http://gerrit.cloudera.org:8080/#/c/19534/8/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@371 PS8, Line 371: file See L175. -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 8 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 29 Mar 2023 13:01:09 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 7: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/12704/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 7 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 29 Mar 2023 07:29:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19534 to look at the new patch set (#7). Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg We push down predicates to Iceberg that uses them to filter out files when getting the results of planFiles(). Using the FileScanTask.residual() function we can find out if we have to use the predicates to further filter the rows of the given files or if Iceberg has already performed all the filtering. Basically if we only filter on partition columns then Iceberg can filter the files and using these filters in Impala wouldn't filter any more rows from the output (assuming that no partition evolution was performed on the table). An additional benefit of not pushing down no-op predicates to the scanner is that we can potentially materialize less slots. For example: SELECT count(1) from iceberg_tbl where part_col = 10; In the above query Iceberg filters the files using the predicate on a partition column and then there won't be any need to materialize 'part_col' in Impala, nor to push down the 'part_col = 10' predicate. Note, this is an all or nothing approach, meaning that assuming N number of predicates we either push down all predicates to the scanner or none of them. There is a room for improvement to identify a subset of the predicates that we still have to push down to the scanner. However, for this we'd need a mapping between Impala predicates and the predicates returned by Iceberg's FileScanTask.residual() function that would significantly increase the complexity of the relevant code. Testing: - Some existing tests needed some extra care as they were checking for predicates being pushed down to the scanner, but with this patch not all of them are pushed down. For these tests I added some extra predicates to achieve that all of the predicates are pushed down to the scanner. - Added a new planner test suite for checking how predicate push down works with Iceberg tables. Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b --- M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test 12 files changed, 309 insertions(+), 71 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/7 -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 7 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 6: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/12701/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 6 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 28 Mar 2023 16:09:18 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 5: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/12700/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 5 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 28 Mar 2023 15:57:02 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/19534 ) Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. Patch Set 6: (17 comments) http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@16 PS1, Line 16: t (ass > Nit: filters. Done http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@17 PS1, Line 17: ). > Wouldn't "assuming" be better? Done http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@17 PS1, Line 17: > Nit: closing parenthesis should come at the end of the sentence. Done http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@20 PS1, Line 20: scanner is that we can potentially materialize les > Am I right that here we mean pushing down predicates to the scanner, as opp Done http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@20 PS1, Line 20: tentiall > not pushing down Done http://gerrit.cloudera.org:8080/#/c/19534/1//COMMIT_MSG@28 PS1, Line 28: > We could include a Testing section describing what tests have been added/mo Done http://gerrit.cloudera.org:8080/#/c/19534/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19534/3//COMMIT_MSG@20 PS3, Line 20: iall > nit: pushing down? Done http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1390 PS3, Line 1390: to > 'is' is not needed here? Done http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2027 PS3, Line 2027: indentPrefix > 'explainLevel' may not be the best name as it could be confused with the ve Done http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java File fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java: http://gerrit.cloudera.org:8080/#/c/19534/1/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java@64 PS1, Line 64: > The comment on 'nonIdentityConjuncts_' says that it is a subset of 'conjunc 'nonIdentityConjuncts_' is in fact a subset of 'conjuncts_' that doesn't include conjuncts on identity partitioned columns. 'skippedConjuncts_' indeed contains identity conjuncts but not necessarily all of them just the ones that won't filter any further rows. OTOH As I see 'nonIdentityConjuncts_' is used for cardinality estimates. However, I wonder if from this point on can we drop use 'conjuncts_' - 'skippedConjuncts_' for this purpose. @Zoltan, do you have a thought on this? http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java: http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@115 PS3, Line 115: has > Nit: has Done http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@241 PS3, Line 241: > I don't think we need this as this is the DELETE SCAN node which wouldn't e Done http://gerrit.cloudera.org:8080/#/c/19534/3/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@388 PS3, Line 388: : %s", getIceTable(). > nit: could be 'Collections.emptyList()' Done http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/datasets/functional/functional_schema_template.sql@3265 PS3, Line 3265: DEPENDENT_LOAD : # We can use 'date_string_col' as it is once IMPALA-11954 is done. : INSERT INTO {db_name}{db_suffix}.iceberg_partition_evolution > Since we are writing the table via Impala we probably don't need to set the Indeed, done http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test: http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test@49 PS3, Line 49: could no > Could not be? Indeed :) http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test: http://gerrit.cloudera.org:8080/#/c/19534/3/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@990 PS3, Line 990: event_time='2020-01-01 11:00:00'; > iceberg_partitioned is partitioned by HOUR(event_time), how c
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19534 to look at the new patch set (#6). Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg We push down predicates to Iceberg that uses them to filter out files when getting the results of planFiles(). Using the FileScanTask.residual() function we can find out if we have to use the predicates to further filter the rows of the given files or if Iceberg has already performed all the filtering. Basically if we only filter on partition columns then Iceberg can filter the files and using these filters in Impala wouldn't filter any more rows from the output (assuming that no partition evolution was performed on the table). An additional benefit of not pushing down no-op predicates to the scanner is that we can potentially materialize less slots. For example: SELECT count(1) from iceberg_tbl where part_col = 10; In the above query Iceberg filters the files using the predicate on a partition column and then there won't be any need to materialize 'part_col' in Impala, nor to push down the 'part_col = 10' predicate. Note, this is an all or nothing approach, meaning that assuming N number of predicates we either push down all predicates to the scanner or none of them. There is a room for improvement to identify a subset of the predicates that we still have to push down to the scanner. However, for this we'd need a mapping between Impala predicates and the predicates returned by Iceberg's FileScanTask.residual() function that would significantly increase the complexity of the relevant code. Testing: - Some existing tests needed some extra care as they were checking for predicates being pushed down to the scanner, but with this patch not all of them are pushed down. For these tests I added some extra predicates to achieve that all of the predicates are pushed down to the scanner. - Added a new planner test suite for checking how predicate push down works with Iceberg tables. Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b --- M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test 12 files changed, 311 insertions(+), 71 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/6 -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 6 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/19534 to look at the new patch set (#5). Change subject: IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg .. IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg We push down predicates to Iceberg that uses them to filter out files when getting the results of planFiles(). Using the FileScanTask.residual() function we can find out if we have to use the predicates to further filter the rows of the given files or if Iceberg has already performed all the filtering. Basically if we only filter on partition columns then Iceberg can filter the files and using these filters in Impala wouldn't filter any more rows from the output (assuming that no partition evolution was performed on the table). An additional benefit of not pushing down no-op predicates to the scanner is that we can potentially materialize less slots. For example: SELECT count(1) from iceberg_tbl where part_col = 10; In the above query Iceberg filters the files using the predicate on a partition column and then there won't be any need to materialize 'part_col' in Impala, nor to push down the 'part_col = 10' predicate. Note, this is an all or nothing approach, meaning that assuming N number of predicates we either push down all predicates to the scanner or none of them. There is a room for improvement to identify a subset of the predicates that we still have to push down to the scanner. However, for this we'd need a mapping between Impala predicates and the predicates returned by Iceberg's FileScanTask.residual() function that would significantly increase the complexity of the relevant code. Testing: - Some existing tests needed some extra care as they were checking for predicates being pushed down to the scanner, but with this patch not all of them are pushed down. For these tests I added some extra predicates to achieve that all of the predicates are pushed down to the scanner. - Added a new planner test suite for checking how predicate push down works with Iceberg tables. Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b --- M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/iceberg-predicates.test M testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test M testdata/workloads/functional-planner/queries/PlannerTest/tablesample.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-in-predicate-push-down.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test M testdata/workloads/functional-query/queries/QueryTest/iceberg-plain-count-star-optimization.test 12 files changed, 311 insertions(+), 71 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/19534/5 -- To view, visit http://gerrit.cloudera.org:8080/19534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b Gerrit-Change-Number: 19534 Gerrit-PatchSet: 5 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy