Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20753 )

Change subject: IMPALA-12597: Basic Equality delete read support for Iceberg 
tables
......................................................................


Patch Set 3:

(22 comments)

Thanks for taking a look! I hope I addressed all the comments. I still have to 
add new test tables to this patch, though.

http://gerrit.cloudera.org:8080/#/c/20753/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20753/1//COMMIT_MSG@41
PS1, Line 41:
> Is there a Jira for these items? It would help linking this commit with oth
I added the epic jira here


http://gerrit.cloudera.org:8080/#/c/20753/1/be/src/exec/partitioned-hash-join-builder.h
File be/src/exec/partitioned-hash-join-builder.h:

http://gerrit.cloudera.org:8080/#/c/20753/1/be/src/exec/partitioned-hash-join-builder.h@87
PS1, Line 87: files.
> IS NOT DISTINCT FROM is a well-known SQL term, I think it would be better t
Done


http://gerrit.cloudera.org:8080/#/c/20753/1/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

http://gerrit.cloudera.org:8080/#/c/20753/1/common/thrift/CatalogObjects.thrift@625
PS1, Line 625: 8-bit Murmur
> This might have a better place in TIcebergTable. Though I see this is proba
Yes, for part1 I found it easier to put this into the ContentFileStore, while 
for the more general part this might have to be on the file descriptor level. 
I'd keep this here, next steps will remove this anyway.


http://gerrit.cloudera.org:8080/#/c/20753/2/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

http://gerrit.cloudera.org:8080/#/c/20753/2/common/thrift/CatalogObjects.thrift@635
PS2, Line 635: }
> nit: missing new line
Done


http://gerrit.cloudera.org:8080/#/c/20753/1/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/20753/1/common/thrift/PlanNodes.thrift@404
PS1, Line 404:  when we join Ice
> Again, I think we should keep the SQL terminology here, but keep the additi
I think I over complicated this part. I figured 'IS NOT DISTINCT FROM' (INDF?) 
is completely different than what I wanted to do with the Iceberg join 
conjuncts, but in facinstead of EQ conjuncts I can create INDF conjuncts and 
keep the BE code intact. Thanks


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
File fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java:

http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java@58
PS1, Line 58: except it returns True if both sides are
> Can we update this comment: except it returns True of both sides are NULLs.
Done


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java
File fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java:

http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@106
PS1, Line 106: TODO
> Could you please add IMPALA-12598?
Done


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@108
PS1, Line 108:  by th
> nit: columns
Done


http://gerrit.cloudera.org:8080/#/c/20753/2/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java
File fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java:

http://gerrit.cloudera.org:8080/#/c/20753/2/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@106
PS2, Line 106: TODO
> TODO without Jira ID.
Done


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java
File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java:

http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java@160
PS1, Line 160:
> This wouldn't be needed if we passed Operator.NOT_DISTINCT in the equality
Done


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java
File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java:

http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@137
PS1, Line 137: er_
> Is this right as equalityIds_ is not currently a List, but an (unsorted?) S
Ohh, that was the initial implementation, but then I changed it to a HashSet.


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@197
PS1, Line 197:
> nit: for readability, it might be worth to extract this condition to a 'noD
Done


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@258
PS1, Line 258:   Collections.emptyList()); /*skip
> nit: this is always needed for equality deletes, so this method call could
Done


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@323
PS1, Line 323:     BinaryPredicate filePathEq = null;
             :     BinaryPredicate posEq = null;
             :     for (SlotDescriptor deleteSlotDesc : 
deleteTupleDesc.getSlots()) {
             :       boolean foundM
> Maybe SingleNodePlanner.addSlotRefToDesc() could return he slot desc.
Makes sense. Changed addDataVirtualPositionSlots() above accordingly.


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@405
PS1, Line 405:
> This must be always true for non-virtual columns, right?
For non-virtual columns yes, this would always be true. However, if there are 
position delete files too in the table, then we'd have 2 more extra virtual 
columns here that I would like to simply skip with this if.


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@424
PS1, Line 424:
> table?
Done


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@429
PS1, Line 429:  leftSideOf
> Could we have Operator.NOT_DISTINCT here?
Done


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@464
PS1, Line 464:         Collections.emptyList(), /*nonIdentityConjuncts*/
             :         Collections.emptyList()); /*skippedConjuncts*/
             :     deleteScanNode.init(analyzer_);
             :
> Why do we have this restriction? We throw an error if there are files with
Those are the 2 cases I'm aware of. I can still loosen this a bit to keep the 
other 2 error cases but we could allow schema evolution for fields not used by 
equality deletes (basically dropping this check). I figured Flink can't do 
schema evolution anyway so throwing an error in this case would save us some 
testing efforts.


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@486
PS1, Line 486:       long dataFilesCacheMisses = 0;
             :       for (FileScanTask fileScanTask : fileScanTasks) {
             :         Expression residualExpr = fileScanTask.residual();
             :         if (residualExpr != null && !(residualExpr instanceof 
True)) {
             :           residualExpressions_.add(residualExpr);
             :         }
             :         Pair<FileDescriptor, Boolean> fileDesc = 
getFileDescriptor(fileScanTask.file());
             :         if (!fileDesc.second) ++dataFilesCacheMisses;
             :         if (fileScanTask.deletes().isEmpty()) {
             :           dataFilesWithoutDeletes_.add(file
> nit: This is similar to what we have for position delete tables at L249. Ca
Looks similar bun in fact L486-L489 creates a different type of table using 
different parameters so that part won't be that convenient to use both purposes.
L492 could be reused (with the tblRef name as param) but then adding the slots 
also are different. So I'd say that it's more readable this way than having a 
common function.


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@501
PS1, Line 501:     if (delFile.content(
> Maybe we could have a TODO+Jira about adding conjuncts that could be applie
I haven't thought about this but good point.


http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@517
PS1, Line 517:   }
> If we passed Operator.NOT_DISTINCT, we wouldn't need to set this, so the ha
Done


http://gerrit.cloudera.org:8080/#/c/20753/2/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test:

http://gerrit.cloudera.org:8080/#/c/20753/2/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@1179
PS2, Line 1179: |
> nit: trailing spaces
Done



--
To view, visit http://gerrit.cloudera.org:8080/20753
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2053e6f321c69f1c82059a84a5d99aeaa9814cad
Gerrit-Change-Number: 20753
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tma...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Dec 2023 14:36:04 +0000
Gerrit-HasComments: Yes

Reply via email to