[Impala-ASF-CR] IMPALA-12894: Optimized count(*) for Iceberg gives wrong results after a Spark rewrite data files

2024-03-25 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21190 )

Change subject: IMPALA-12894: Optimized count(*) for Iceberg gives wrong 
results after a Spark rewrite_data_files
..


Patch Set 3: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/21190
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
Gerrit-Change-Number: 21190
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 25 Mar 2024 19:15:44 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12894: Optimized count(*) for Iceberg gives wrong results after a Spark rewrite data files

2024-03-25 Thread Daniel Becker (Code Review)
Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21190 )

Change subject: IMPALA-12894: Optimized count(*) for Iceberg gives wrong 
results after a Spark rewrite_data_files
..


Patch Set 3:

(2 comments)

I've only gone through the non-test files so far.

http://gerrit.cloudera.org:8080/#/c/21190/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21190/3//COMMIT_MSG@10
PS3, Line 10: files. During analysis we check the existence of delete files
Could you describe the cause of the bug in more detail?


http://gerrit.cloudera.org:8080/#/c/21190/3/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
File fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java:

http://gerrit.cloudera.org:8080/#/c/21190/3/fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java@983
PS3, Line 983: use
Nit: superfluous "use".



--
To view, visit http://gerrit.cloudera.org:8080/21190
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
Gerrit-Change-Number: 21190
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 25 Mar 2024 14:58:13 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12894: Optimized count(*) for Iceberg gives wrong results after a Spark rewrite data files

2024-03-25 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21190 )

Change subject: IMPALA-12894: Optimized count(*) for Iceberg gives wrong 
results after a Spark rewrite_data_files
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/15650/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/21190
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
Gerrit-Change-Number: 21190
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 25 Mar 2024 14:39:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12894: Optimized count(*) for Iceberg gives wrong results after a Spark rewrite data files

2024-03-25 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21190 )

Change subject: IMPALA-12894: Optimized count(*) for Iceberg gives wrong 
results after a Spark rewrite_data_files
..


Patch Set 3:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/10422/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/21190
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
Gerrit-Change-Number: 21190
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 25 Mar 2024 14:19:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12894: Optimized count(*) for Iceberg gives wrong results after a Spark rewrite data files

2024-03-25 Thread Zoltan Borok-Nagy (Code Review)
Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/21190

to look at the new patch set (#3).

Change subject: IMPALA-12894: Optimized count(*) for Iceberg gives wrong 
results after a Spark rewrite_data_files
..

IMPALA-12894: Optimized count(*) for Iceberg gives wrong results after a Spark 
rewrite_data_files

Impala can return incorrect results if a table has dangling delete
files. During analysis we check the existence of delete files
based on the snapshot summary. But during planning in IcebergScanPlanner
we do it based on planFiles(), i.e. dangling delete files don't count
in the latter case. Because of this Impala can create incorrect
plans for count(*) optimization.

This patch fixes the FeIcebergTable.hasDeleteFiles() method, so it
ignores dangling delete files. It also introduces a new query option,
"iceberg_disable_count_star_optimization", so users can completely
disable the statistic-based count(*)-optimization if necessary.

Testing:
 * e2e tests
 * planner tests

Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables-hash-join.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes-orc.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes.test
11 files changed, 336 insertions(+), 433 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/90/21190/3
--
To view, visit http://gerrit.cloudera.org:8080/21190
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
Gerrit-Change-Number: 21190
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-12894: Optimized count(*) for Iceberg gives wrong results after a Spark rewrite data files

2024-03-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21190 )

Change subject: IMPALA-12894: Optimized count(*) for Iceberg gives wrong 
results after a Spark rewrite_data_files
..


Patch Set 2: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/21190
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
Gerrit-Change-Number: 21190
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Fri, 22 Mar 2024 23:36:59 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12894: Optimized count(*) for Iceberg gives wrong results after a Spark rewrite data files

2024-03-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21190 )

Change subject: IMPALA-12894: Optimized count(*) for Iceberg gives wrong 
results after a Spark rewrite_data_files
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/15635/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/21190
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
Gerrit-Change-Number: 21190
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Fri, 22 Mar 2024 19:19:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12894: Optimized count(*) for Iceberg gives wrong results after a Spark rewrite data files

2024-03-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21190 )

Change subject: IMPALA-12894: Optimized count(*) for Iceberg gives wrong 
results after a Spark rewrite_data_files
..


Patch Set 1:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/15634/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/21190
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
Gerrit-Change-Number: 21190
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Fri, 22 Mar 2024 19:10:39 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12894: Optimized count(*) for Iceberg gives wrong results after a Spark rewrite data files

2024-03-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21190 )

Change subject: IMPALA-12894: Optimized count(*) for Iceberg gives wrong 
results after a Spark rewrite_data_files
..


Patch Set 2:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/10416/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/21190
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
Gerrit-Change-Number: 21190
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Fri, 22 Mar 2024 18:18:30 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12894: Optimized count(*) for Iceberg gives wrong results after a Spark rewrite data files

2024-03-22 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has uploaded a new patch set (#2). ( 
http://gerrit.cloudera.org:8080/21190 )

Change subject: IMPALA-12894: Optimized count(*) for Iceberg gives wrong 
results after a Spark rewrite_data_files
..

IMPALA-12894: Optimized count(*) for Iceberg gives wrong results after a Spark 
rewrite_data_files

Impala can return incorrect results if a table has dangling delete
files. During analysis we check the existence of delete files
based on the snapshot summary. But during planning in IcebergScanPlanner
we do it based on planFiles(), i.e. dangling delete files don't count
in the latter case. Because of this Impala can create incorrect
plans for count(*) optimization.

This patch fixes the FeIcebergTable.hasDeleteFiles() method, so it
ignores dangling delete files.

TODO:
 * introduce query option so we can completely disable the count(*) optimization

Testing:
 * e2e tests
 * planner tests

Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
---
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables-hash-join.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes-orc.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes.test
7 files changed, 307 insertions(+), 431 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/90/21190/2
--
To view, visit http://gerrit.cloudera.org:8080/21190
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
Gerrit-Change-Number: 21190
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-12894: Optimized count(*) for Iceberg gives wrong results after a Spark rewrite data files

2024-03-22 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/21190


Change subject: IMPALA-12894: Optimized count(*) for Iceberg gives wrong 
results after a Spark rewrite_data_files
..

IMPALA-12894: Optimized count(*) for Iceberg gives wrong results after a Spark 
rewrite_data_files

Impala can return incorrect results if a table has dangling delete
files. During analysis we check the existence of delete files
based on the snapshot summary. But during planning in IcebergScanPlanner
we do it based on planFiles(), i.e. dangling delete files don't count
in the latter case. Because of this Impala can create incorrect
plans for count(*) optimization.

This patch fixes the FeIcebergTable.hasDeleteFiles() method, so it
ignores dangling delete files.

TODO:
 * introduce query option so we can completely disable the count(*) optimization

Testing:
 * e2e tests
 * planner tests

Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
---
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables-hash-join.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes-orc.test
M 
testdata/workloads/functional-query/queries/QueryTest/iceberg-v2-read-position-deletes.test
7 files changed, 307 insertions(+), 430 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/90/21190/1
--
To view, visit http://gerrit.cloudera.org:8080/21190
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie3aca0b0a104f9ca4589cde9643f3f341d4ff99f
Gerrit-Change-Number: 21190
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy