Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/20753 )
Change subject: IMPALA-12597: Basic Equality delete read support for Iceberg tables ...................................................................... Patch Set 3: (22 comments) Thanks for taking a look! I hope I addressed all the comments. I still have to add new test tables to this patch, though. http://gerrit.cloudera.org:8080/#/c/20753/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20753/1//COMMIT_MSG@41 PS1, Line 41: > Is there a Jira for these items? It would help linking this commit with oth I added the epic jira here http://gerrit.cloudera.org:8080/#/c/20753/1/be/src/exec/partitioned-hash-join-builder.h File be/src/exec/partitioned-hash-join-builder.h: http://gerrit.cloudera.org:8080/#/c/20753/1/be/src/exec/partitioned-hash-join-builder.h@87 PS1, Line 87: files. > IS NOT DISTINCT FROM is a well-known SQL term, I think it would be better t Done http://gerrit.cloudera.org:8080/#/c/20753/1/common/thrift/CatalogObjects.thrift File common/thrift/CatalogObjects.thrift: http://gerrit.cloudera.org:8080/#/c/20753/1/common/thrift/CatalogObjects.thrift@625 PS1, Line 625: 8-bit Murmur > This might have a better place in TIcebergTable. Though I see this is proba Yes, for part1 I found it easier to put this into the ContentFileStore, while for the more general part this might have to be on the file descriptor level. I'd keep this here, next steps will remove this anyway. http://gerrit.cloudera.org:8080/#/c/20753/2/common/thrift/CatalogObjects.thrift File common/thrift/CatalogObjects.thrift: http://gerrit.cloudera.org:8080/#/c/20753/2/common/thrift/CatalogObjects.thrift@635 PS2, Line 635: } > nit: missing new line Done http://gerrit.cloudera.org:8080/#/c/20753/1/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/20753/1/common/thrift/PlanNodes.thrift@404 PS1, Line 404: when we join Ice > Again, I think we should keep the SQL terminology here, but keep the additi I think I over complicated this part. I figured 'IS NOT DISTINCT FROM' (INDF?) is completely different than what I wanted to do with the Iceberg join conjuncts, but in facinstead of EQ conjuncts I can create INDF conjuncts and keep the BE code intact. Thanks http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java File fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java: http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java@58 PS1, Line 58: except it returns True if both sides are > Can we update this comment: except it returns True of both sides are NULLs. Done http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java File fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java: http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@106 PS1, Line 106: TODO > Could you please add IMPALA-12598? Done http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@108 PS1, Line 108: by th > nit: columns Done http://gerrit.cloudera.org:8080/#/c/20753/2/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java File fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java: http://gerrit.cloudera.org:8080/#/c/20753/2/fe/src/main/java/org/apache/impala/catalog/IcebergContentFileStore.java@106 PS2, Line 106: TODO > TODO without Jira ID. Done http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java: http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java@160 PS1, Line 160: > This wouldn't be needed if we passed Operator.NOT_DISTINCT in the equality Done http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java File fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java: http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@137 PS1, Line 137: er_ > Is this right as equalityIds_ is not currently a List, but an (unsorted?) S Ohh, that was the initial implementation, but then I changed it to a HashSet. http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@197 PS1, Line 197: > nit: for readability, it might be worth to extract this condition to a 'noD Done http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@258 PS1, Line 258: Collections.emptyList()); /*skip > nit: this is always needed for equality deletes, so this method call could Done http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@323 PS1, Line 323: BinaryPredicate filePathEq = null; : BinaryPredicate posEq = null; : for (SlotDescriptor deleteSlotDesc : deleteTupleDesc.getSlots()) { : boolean foundM > Maybe SingleNodePlanner.addSlotRefToDesc() could return he slot desc. Makes sense. Changed addDataVirtualPositionSlots() above accordingly. http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@405 PS1, Line 405: > This must be always true for non-virtual columns, right? For non-virtual columns yes, this would always be true. However, if there are position delete files too in the table, then we'd have 2 more extra virtual columns here that I would like to simply skip with this if. http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@424 PS1, Line 424: > table? Done http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@429 PS1, Line 429: leftSideOf > Could we have Operator.NOT_DISTINCT here? Done http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@464 PS1, Line 464: Collections.emptyList(), /*nonIdentityConjuncts*/ : Collections.emptyList()); /*skippedConjuncts*/ : deleteScanNode.init(analyzer_); : > Why do we have this restriction? We throw an error if there are files with Those are the 2 cases I'm aware of. I can still loosen this a bit to keep the other 2 error cases but we could allow schema evolution for fields not used by equality deletes (basically dropping this check). I figured Flink can't do schema evolution anyway so throwing an error in this case would save us some testing efforts. http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@486 PS1, Line 486: long dataFilesCacheMisses = 0; : for (FileScanTask fileScanTask : fileScanTasks) { : Expression residualExpr = fileScanTask.residual(); : if (residualExpr != null && !(residualExpr instanceof True)) { : residualExpressions_.add(residualExpr); : } : Pair<FileDescriptor, Boolean> fileDesc = getFileDescriptor(fileScanTask.file()); : if (!fileDesc.second) ++dataFilesCacheMisses; : if (fileScanTask.deletes().isEmpty()) { : dataFilesWithoutDeletes_.add(file > nit: This is similar to what we have for position delete tables at L249. Ca Looks similar bun in fact L486-L489 creates a different type of table using different parameters so that part won't be that convenient to use both purposes. L492 could be reused (with the tblRef name as param) but then adding the slots also are different. So I'd say that it's more readable this way than having a common function. http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@501 PS1, Line 501: if (delFile.content( > Maybe we could have a TODO+Jira about adding conjuncts that could be applie I haven't thought about this but good point. http://gerrit.cloudera.org:8080/#/c/20753/1/fe/src/main/java/org/apache/impala/planner/IcebergScanPlanner.java@517 PS1, Line 517: } > If we passed Operator.NOT_DISTINCT, we wouldn't need to set this, so the ha Done http://gerrit.cloudera.org:8080/#/c/20753/2/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test: http://gerrit.cloudera.org:8080/#/c/20753/2/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-tables.test@1179 PS2, Line 1179: | > nit: trailing spaces Done -- To view, visit http://gerrit.cloudera.org:8080/20753 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2053e6f321c69f1c82059a84a5d99aeaa9814cad Gerrit-Change-Number: 20753 Gerrit-PatchSet: 3 Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tamas Mate <tma...@apache.org> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Fri, 08 Dec 2023 14:36:04 +0000 Gerrit-HasComments: Yes