Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/20760 )
Change subject: IMPALA-12313: (part 3) Add UPDATE support for Iceberg tables. ...................................................................... Patch Set 7: (14 comments) Finally gone through all of the files. http://gerrit.cloudera.org:8080/#/c/20760/6/be/src/exec/table-sink-base.h File be/src/exec/table-sink-base.h: http://gerrit.cloudera.org:8080/#/c/20760/6/be/src/exec/table-sink-base.h@90 PS6, Line 90: The partition key is derived from 'row'. We don't have 'row' anymore. We could add that the partition key is now supposed to be available in 'output_partition' (if this is correct). http://gerrit.cloudera.org:8080/#/c/20760/6/be/src/exec/table-sink-base.cc File be/src/exec/table-sink-base.cc: http://gerrit.cloudera.org:8080/#/c/20760/6/be/src/exec/table-sink-base.cc@365 PS6, Line 365: OutputPartition Is there a reason for using separate OutputPartition and vector parameters instead of PartitionPair? And not clearing the vector at the end of this function? http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java File fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java: http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@115 PS6, Line 115: If there are duplicates in the JOIN operator Strictly speaking, we cannot check for duplicates even if there are none, so this clause is not needed, though it is useful to mention that the situation described below happens if there is a JOIN. http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@126 PS6, Line 126: via UPDATE FROM 'modifyStmt_.fromClause_.size() > 1' suggests if there is only one table (or view) in the FROM clause then it should be ok. Is that viable? http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java File fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java: http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java@101 PS6, Line 101: public TSortingOrder getSortingOrder() { Wouldn't it be better to not have 'getSortingOrder()' in DmlStatementBase but at a lower lever in the class hierarchy? Or would it make sense for OptimizeStmt too, it's just not implemented yet? http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java File fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java: http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java@34 PS6, Line 34: TableSink Would it make sense to have an IcebergDeleteSinkBase in the Java code too? Some functions and fields seem to be the same or very similar here and in IcebergDeleteSink. http://gerrit.cloudera.org:8080/#/c/20760/7/fe/src/main/java/org/apache/impala/planner/IcebergDeleteSink.java File fe/src/main/java/org/apache/impala/planner/IcebergDeleteSink.java: http://gerrit.cloudera.org:8080/#/c/20760/7/fe/src/main/java/org/apache/impala/planner/IcebergDeleteSink.java@76 PS7, Line 76: 1024L * 1024L We could use ByteUnits.MEGABYTE, but see comment at IcebergBufferedDeleteSink.java:34. http://gerrit.cloudera.org:8080/#/c/20760/7/fe/src/main/java/org/apache/impala/planner/Planner.java File fe/src/main/java/org/apache/impala/planner/Planner.java: http://gerrit.cloudera.org:8080/#/c/20760/7/fe/src/main/java/org/apache/impala/planner/Planner.java@926 PS7, Line 926: get 'partitionKeyExprs'? http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/datasets/functional/functional_schema_template.sql@3407 PS7, Line 3407: INTO Shouldn't this be an INSERT OVERWRITE TABLE, like for example on L3469? If we re-load the table without first deleting it then it will add extra rows. http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/datasets/functional/functional_schema_template.sql@3407 PS7, Line 3407: iceberg_partition_transforms_zorder Shouldn't this be '{table_name}'? http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-update.test File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-update.test: http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-update.test@340 PS7, Line 340: from iceberg_partition_transforms_zorder ice_zorder, iceberg_partitioned source Just curious: can we also use INNER JOIN syntax here? http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test File testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test: http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@252 PS7, Line 252: FROM clause Is it important that it is a FROM clause with multiple tables? http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-partitions.test File testdata/workloads/functional-query/queries/QueryTest/iceberg-update-partitions.test: http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-partitions.test@1 PS7, Line 1: ==== Could also add a table with "sort by". http://gerrit.cloudera.org:8080/#/c/20760/7/tests/stress/test_update_stress.py File tests/stress/test_update_stress.py: http://gerrit.cloudera.org:8080/#/c/20760/7/tests/stress/test_update_stress.py@65 PS7, Line 65: "ice_store_sales" Nit: this could be included in the literal of 'stmt'. -- To view, visit http://gerrit.cloudera.org:8080/20760 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bb97b4454165a292975d88dc9c23adb22ff7315 Gerrit-Change-Number: 20760 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tamas Mate <tma...@apache.org> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Tue, 12 Dec 2023 16:54:58 +0000 Gerrit-HasComments: Yes