[Impala-ASF-CR] IMPALA-12313: (part 3) Add UPDATE support for Iceberg tables.

Daniel Becker (Code Review) Tue, 12 Dec 2023 08:59:14 -0800

Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20760 )


Change subject: IMPALA-12313: (part 3) Add UPDATE support for Iceberg tables.
......................................................................


Patch Set 7:

(14 comments)

Finally gone through all of the files.

http://gerrit.cloudera.org:8080/#/c/20760/6/be/src/exec/table-sink-base.h
File be/src/exec/table-sink-base.h:

http://gerrit.cloudera.org:8080/#/c/20760/6/be/src/exec/table-sink-base.h@90
PS6, Line 90: The partition key is derived from 'row'.
We don't have 'row' anymore. We could add that the partition key is now 
supposed to be available in 'output_partition' (if this is correct).


http://gerrit.cloudera.org:8080/#/c/20760/6/be/src/exec/table-sink-base.cc
File be/src/exec/table-sink-base.cc:

http://gerrit.cloudera.org:8080/#/c/20760/6/be/src/exec/table-sink-base.cc@365
PS6, Line 365: OutputPartition
Is there a reason for using separate OutputPartition and vector parameters 
instead of PartitionPair? And not clearing the vector at the end of this 
function?


http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
File fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java:

http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@115
PS6, Line 115: If there are duplicates in the JOIN operator
Strictly speaking, we cannot check for duplicates even if there are none, so 
this clause is not needed, though it is useful to mention that the situation 
described below happens if there is a JOIN.


http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@126
PS6, Line 126: via UPDATE FROM
'modifyStmt_.fromClause_.size() > 1' suggests if there is only one table (or 
view) in the FROM clause then it should be ok. Is that viable?


http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java
File fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java:

http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java@101
PS6, Line 101:   public TSortingOrder getSortingOrder() {
Wouldn't it be better to not have 'getSortingOrder()' in DmlStatementBase but 
at a lower lever in the class hierarchy? Or would it make sense for 
OptimizeStmt too, it's just not implemented yet?


http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java
File fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java:

http://gerrit.cloudera.org:8080/#/c/20760/6/fe/src/main/java/org/apache/impala/planner/IcebergBufferedDeleteSink.java@34
PS6, Line 34: TableSink
Would it make sense to have an IcebergDeleteSinkBase in the Java code too? Some 
functions and fields seem to be the same or very similar here and in 
IcebergDeleteSink.


http://gerrit.cloudera.org:8080/#/c/20760/7/fe/src/main/java/org/apache/impala/planner/IcebergDeleteSink.java
File fe/src/main/java/org/apache/impala/planner/IcebergDeleteSink.java:

http://gerrit.cloudera.org:8080/#/c/20760/7/fe/src/main/java/org/apache/impala/planner/IcebergDeleteSink.java@76
PS7, Line 76: 1024L * 1024L
We could use ByteUnits.MEGABYTE, but see comment at 
IcebergBufferedDeleteSink.java:34.


http://gerrit.cloudera.org:8080/#/c/20760/7/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

http://gerrit.cloudera.org:8080/#/c/20760/7/fe/src/main/java/org/apache/impala/planner/Planner.java@926
PS7, Line 926: get
'partitionKeyExprs'?


http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/datasets/functional/functional_schema_template.sql@3407
PS7, Line 3407: INTO
Shouldn't this be an INSERT OVERWRITE TABLE, like for example on L3469? If we 
re-load the table without first deleting it then it will add extra rows.


http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/datasets/functional/functional_schema_template.sql@3407
PS7, Line 3407: iceberg_partition_transforms_zorder
Shouldn't this be '{table_name}'?


http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-update.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-update.test:

http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-v2-update.test@340
PS7, Line 340: from iceberg_partition_transforms_zorder ice_zorder, 
iceberg_partitioned source
Just curious: can we also use INNER JOIN syntax here?


http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test
File 
testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test:

http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@252
PS7, Line 252: FROM clause
Is it important that it is a FROM clause with multiple tables?


http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-partitions.test
File 
testdata/workloads/functional-query/queries/QueryTest/iceberg-update-partitions.test:

http://gerrit.cloudera.org:8080/#/c/20760/7/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-partitions.test@1
PS7, Line 1: ====
Could also add a table with "sort by".


http://gerrit.cloudera.org:8080/#/c/20760/7/tests/stress/test_update_stress.py
File tests/stress/test_update_stress.py:

http://gerrit.cloudera.org:8080/#/c/20760/7/tests/stress/test_update_stress.py@65
PS7, Line 65: "ice_store_sales"
Nit: this could be included in the literal of 'stmt'.



--
To view, visit http://gerrit.cloudera.org:8080/20760
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bb97b4454165a292975d88dc9c23adb22ff7315
Gerrit-Change-Number: 20760
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tma...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 Dec 2023 16:54:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-12313: (part 3) Add UPDATE support for Iceberg tables.

Reply via email to