[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-11-07 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/20677


Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..

IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

This patch adds limited UPDATE support for Iceberg tables. The
limitations mean users cannot update Iceberg tables if any of
the followings are true:
 * UPDATE value partition column
 * UPDATE table that went through partition evolution
 * Table has SORT BY properties

The above limitations will be resolved by part 3.

This patch implements UPDATEs with the merge-on-read technique. This
means the UPDATE statement writes both data files and delete files.
Data files contain the updated records, delete files contain the
position delete records of the old data records that have been
touched.

To achieve the above this patch introduces a new sink: MultiDataSink.
We can configure multiple TableSinks for a single MultiDataSink object.
During execution, the row batches sent to the MultiDataSink will be
forwarded to all the TableSinks that have been registered.

The UPDATE statement for an Iceberg table creates a source select
statement with all table columns and virtual columns INPUT__FILE__NAME
and FILE__POSITION. E.g. imagine we have a table 'tbl' with schema
(i int, s string, k int), and we update the table with:

  UPDATE tbl SET k = 5 WHERE i % 100 = 11;

 The generated source statement will be ==>

  SELECT i, s, 5, INPUT__FILE__NAME, FILE__POSITION
  FROM tbl WHERE i % 100 = 11;

Then we create two table sinks that refer to expressions from the above
source statement:

  Insert sink (i, s, 5)
  Delete sink (INPUT__FILE__NAME, FILE__POSITION)

The tuples in the rowbatch of MultiDataSink contain slots for all the
above expressions (i, s, 5, INPUT__FILE__NAME, FILE__POSITION).
MultiDataSink forwards each row batch to each registered TableSink.
They will pick their relevant expressions from the tuple and write
data/delete files. The tuples are sorted by INPUTE__FILE__NAME and
FILE__POSITION because we need to write the delete records in this
order.

For partitioned tables we need to shuffle and sort the input tuples.
In this case we also add virtual columns "PARTITION__SPEC__ID" and
"ICEBERG__PARTITION__SERIALIZED" to the source statement and shuffle
and sort the rows based on them.

Data files and delete files are now separated in the DmlExecState, so
at the end of the operation we'll have two sets of files. We use these
two sets to create a new Iceberg snapshot.

Why this patch has the limitations?
 - Because we are shuffling and sorting rows based on the delete
   records and their partitions. This means that the new data files
   might not get written in an efficient way, e.g. there will be
   too many of them, or we will need to keep too many open file
   handles during writing.
   Also, if the table has SORT BY properties, we cannot respect
   it as the input rows are ordered in a way to favor the position
   deletes.
   Patch 3 will introduce a buffering writer for position delete
   files. This means we will shuffle and sort records based on
   the data records' partitions and SORT BY properties while
   delete records get buffered and written out at the end (sorted
   by file_path and position). In some edge cases the delete records
   might not get written efficiently, but it is a smaller problem
   then inefficient data files.

Testing:
 * negative tests
 * Basice e2e testing with all supported data types

Testing TODO:
 * partitioned tables
 * authz
 * planner test
 * Impala/Hive interop tests
 * concurrent tests

Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
---
M be/src/exec/CMakeLists.txt
M be/src/exec/data-sink.cc
M be/src/exec/iceberg-delete-sink.cc
M be/src/exec/iceberg-delete-sink.h
A be/src/exec/multi-table-sink.cc
A be/src/exec/multi-table-sink.h
M be/src/exec/table-sink-base.cc
M be/src/exec/table-sink-base.h
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/service/client-request-state.cc
M common/protobuf/control_service.proto
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
M fe/src/main/java/org/apache/impala/analysis/DmlStatementBase.java
M fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java
M fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
A fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java
M fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java
M fe/src/main/java/org/apache/impala/analysis/ModifyStmt.java
M fe/src/main/java/org/apache/impala/analysis/OptimizeStmt.java
M fe/src/mai

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-11-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 1:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/14359/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 07 Nov 2023 17:46:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-11-09 Thread Andrew Sherman (Code Review)
Andrew Sherman has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 1:

(11 comments)

I read through once and had some quick comments.
This not a proper review :-(

http://gerrit.cloudera.org:8080/#/c/20677/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20677/1//COMMIT_MSG@14
PS1, Line 14:  * Table has SORT BY properties
Is there also a restriction that TBLPROPERTIES must specify merge-on-read?
Edit: I see Impala does set this, but there may be other tables that don't have 
it?
Also only on iceberg v2 tables?


http://gerrit.cloudera.org:8080/#/c/20677/1//COMMIT_MSG@83
PS1, Line 83: Basice
Spelling: "Basic"


http://gerrit.cloudera.org:8080/#/c/20677/1/be/src/exec/multi-table-sink.h
File be/src/exec/multi-table-sink.h:

http://gerrit.cloudera.org:8080/#/c/20677/1/be/src/exec/multi-table-sink.h@32
PS1, Line 32:   DataSink* CreateSink(RuntimeState* state) const override;
It would be good at some point to have descriptions for all the methods.


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
File fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java:

http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java@65
PS1, Line 65: additinal
Spelling: additional


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
File fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@52
PS1, Line 52: protected
This and next line could be private I think


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@68
PS1, Line 68: spesc
Spelling: "specs"


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@160
PS1, Line 160: cast
Nit: "Cast"


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@161
PS1, Line 161: table
Nit: "table."


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java
File fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java@42
PS1, Line 42:   abstract void buildAndValidateSelectExprs(Analyzer analyzer,
Add description


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java
File fe/src/main/java/org/apache/impala/planner/MultiDataSink.java:

http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@30
PS1, Line 30: public class MultiDataSink extends DataSink {
Add description


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/planner/Planner.java@276
PS1, Line 276: repartition
Nit: "Repartition"



--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Thu, 09 Nov 2023 17:45:10 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-11-13 Thread Noemi Pap-Takacs (Code Review)
Noemi Pap-Takacs has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 1:

(1 comment)

I could not read it all through, just one question.

http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java
File fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java@167
PS1, Line 167: sortExprs_.add(ref);
'sortExprs_' is defined in ModifyImpl.java and has the following comment:

'For every column of the target table that is referenced in the optional 
'sort.columns' table property, this list will contain the corresponding result 
expr from 'resultExprs_'. Before insertion, all rows will be sorted by these 
exprs. If the list is empty, no additional sorting by non-partitioning columns 
will be performed.The column list must not contain partition columns and must 
be empty for non-Hdfs tables.'

This comment does not align with line 167.
As far as I understand, the comment suggests, that partitioning columns should 
not be added to 'sortExprs_'.

Should 'sortExprs_' be empty in KuduModifyImpl since Kudu is not a HDFS table? 
Does anything rely on this variable? Is it shown in the plan? Or maybe the 
comment needs to be updated?



--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Comment-Date: Mon, 13 Nov 2023 10:43:34 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-11-17 Thread Zoltan Borok-Nagy (Code Review)
Hello Andrew Sherman, Daniel Becker, Gabor Kaszab, Noemi Pap-Takacs, Impala 
Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20677

to look at the new patch set (#2).

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..

IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

This patch adds limited UPDATE support for Iceberg tables. The
limitations mean users cannot update Iceberg tables if any of
the followings are true:
 * UPDATE value of partitioning column
 * UPDATE table that went through partition evolution
 * Table has SORT BY properties

The above limitations will be resolved by part 3. The usual limitations
like writing non-Parquet files, using copy-on-write, modifying V1 tables
are out of scope of IMPALA-12313.

This patch implements UPDATEs with the merge-on-read technique. This
means the UPDATE statement writes both data files and delete files.
Data files contain the updated records, delete files contain the
position delete records of the old data records that have been
touched.

To achieve the above this patch introduces a new sink: MultiDataSink.
We can configure multiple TableSinks for a single MultiDataSink object.
During execution, the row batches sent to the MultiDataSink will be
forwarded to all the TableSinks that have been registered.

The UPDATE statement for an Iceberg table creates a source select
statement with all table columns and virtual columns INPUT__FILE__NAME
and FILE__POSITION. E.g. imagine we have a table 'tbl' with schema
(i int, s string, k int), and we update the table with:

  UPDATE tbl SET k = 5 WHERE i % 100 = 11;

 The generated source statement will be ==>

  SELECT i, s, 5, INPUT__FILE__NAME, FILE__POSITION
  FROM tbl WHERE i % 100 = 11;

Then we create two table sinks that refer to expressions from the above
source statement:

  Insert sink (i, s, 5)
  Delete sink (INPUT__FILE__NAME, FILE__POSITION)

The tuples in the rowbatch of MultiDataSink contain slots for all the
above expressions (i, s, 5, INPUT__FILE__NAME, FILE__POSITION).
MultiDataSink forwards each row batch to each registered TableSink.
They will pick their relevant expressions from the tuple and write
data/delete files. The tuples are sorted by INPUTE__FILE__NAME and
FILE__POSITION because we need to write the delete records in this
order.

For partitioned tables we need to shuffle and sort the input tuples.
In this case we also add virtual columns "PARTITION__SPEC__ID" and
"ICEBERG__PARTITION__SERIALIZED" to the source statement and shuffle
and sort the rows based on them.

Data files and delete files are now separated in the DmlExecState, so
at the end of the operation we'll have two sets of files. We use these
two sets to create a new Iceberg snapshot.

Why this patch has the limitations?
 - Because we are shuffling and sorting rows based on the delete
   records and their partitions. This means that the new data files
   might not get written in an efficient way, e.g. there will be
   too many of them, or we will need to keep too many open file
   handles during writing.
   Also, if the table has SORT BY properties, we cannot respect
   it as the input rows are ordered in a way to favor the position
   deletes.
   Patch 3 will introduce a buffering writer for position delete
   files. This means we will shuffle and sort records based on
   the data records' partitions and SORT BY properties while
   delete records get buffered and written out at the end (sorted
   by file_path and position). In some edge cases the delete records
   might not get written efficiently, but it is a smaller problem
   then inefficient data files.

Testing:
 * negative tests
 * planner tests
 * Basic e2e testing with all supported data types

Testing TODO:
 * partitioned tables
 * authz
 * Impala/Hive interop tests
 * concurrent tests

Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
---
M be/src/exec/CMakeLists.txt
M be/src/exec/data-sink.cc
M be/src/exec/iceberg-delete-sink.cc
M be/src/exec/iceberg-delete-sink.h
A be/src/exec/multi-table-sink.cc
A be/src/exec/multi-table-sink.h
M be/src/exec/table-sink-base.cc
M be/src/exec/table-sink-base.h
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/service/client-request-state.cc
M common/protobuf/control_service.proto
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
M fe/src/main/java/org/apache/impala/analysis/DmlStatementBase.java
M fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java
M fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
A fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/s

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-11-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/20677/2/be/src/exec/multi-table-sink.h
File be/src/exec/multi-table-sink.h:

http://gerrit.cloudera.org:8080/#/c/20677/2/be/src/exec/multi-table-sink.h@51
PS2, Line 51: /// MultiTableSink has multiple child table sink objects. It 
sends the received row batches
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/20677/2/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
File fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java:

http://gerrit.cloudera.org:8080/#/c/20677/2/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java@196
PS2, Line 196: for (Map.Entry tableIdEntry : 
additionalTargetTableIds_.entrySet()) {
line too long (91 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Comment-Date: Fri, 17 Nov 2023 15:10:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-11-17 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 2:

(12 comments)

Thanks for the comments. I'll add the remaining tests next week.

http://gerrit.cloudera.org:8080/#/c/20677/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20677/1//COMMIT_MSG@14
PS1, Line 14:  * Table has SORT BY properties
> Is there also a restriction that TBLPROPERTIES must specify merge-on-read?
I listed the limitations that I'm planning to resolve in the scope of 
IMPALA-12313. I added an extra sentence to make it clear.


http://gerrit.cloudera.org:8080/#/c/20677/1//COMMIT_MSG@83
PS1, Line 83: ting:
> Spelling: "Basic"
Done


http://gerrit.cloudera.org:8080/#/c/20677/1/be/src/exec/multi-table-sink.h
File be/src/exec/multi-table-sink.h:

http://gerrit.cloudera.org:8080/#/c/20677/1/be/src/exec/multi-table-sink.h@32
PS1, Line 32: class MultiTableSinkConfig : public DataSinkConfig {
> It would be good at some point to have descriptions for all the methods.
Added comments.


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
File fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java:

http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java@65
PS1, Line 65: additiona
> Spelling: additional
Done


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
File fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@52
PS1, Line 52: private L
> This and next line could be private I think
Done


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@68
PS1, Line 68: specs
> Spelling: "specs"
Done


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@160
PS1, Line 160: targ
> Nit: "Cast"
Done


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java@161
PS1, Line 161: n> co
> Nit: "table."
Done


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java
File fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/KuduModifyImpl.java@167
PS1, Line 167: selectList.add(new S
> 'sortExprs_' is defined in ModifyImpl.java and has the following comment:
Thanks for catching this. It didn't cause a problem because

 * addPartitioningColumn() was actually unused => removed it
 * addKeyColumn was always invoked with isSortingColumn=false => removed this 
parameter.
 * Planner.createPreDmlSort() ignored the partition exprs, only looked at sort 
exprs => now it uses both to behave like the commit message suggests, and in 
IcebergDeleteImpl and IcebergUpdateImpl I don't add the partition exprs to 
'sortExprs_' anymore.


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java
File fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java@42
PS1, Line 42:   /**
> Add description
Done


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java
File fe/src/main/java/org/apache/impala/planner/MultiDataSink.java:

http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@30
PS1, Line 30: /**
> Add description
Done


http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

http://gerrit.cloudera.org:8080/#/c/20677/1/fe/src/main/java/org/apache/impala/planner/Planner.java@276
PS1, Line 276: Repartition
> Nit: "Repartition"
Done



--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 17 Nov 2023 15:12:25 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-11-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/14473/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 17 Nov 2023 15:36:57 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-11-24 Thread Zoltan Borok-Nagy (Code Review)
Hello Andrew Sherman, Daniel Becker, Gabor Kaszab, Noemi Pap-Takacs, Impala 
Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20677

to look at the new patch set (#3).

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..

IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

This patch adds limited UPDATE support for Iceberg tables. The
limitations mean users cannot update Iceberg tables if any of
the followings are true:
 * UPDATE value of partitioning column
 * UPDATE table that went through partition evolution
 * Table has SORT BY properties

The above limitations will be resolved by part 3. The usual limitations
like writing non-Parquet files, using copy-on-write, modifying V1 tables
are out of scope of IMPALA-12313.

This patch implements UPDATEs with the merge-on-read technique. This
means the UPDATE statement writes both data files and delete files.
Data files contain the updated records, delete files contain the
position delete records of the old data records that have been
touched.

To achieve the above this patch introduces a new sink: MultiDataSink.
We can configure multiple TableSinks for a single MultiDataSink object.
During execution, the row batches sent to the MultiDataSink will be
forwarded to all the TableSinks that have been registered.

The UPDATE statement for an Iceberg table creates a source select
statement with all table columns and virtual columns INPUT__FILE__NAME
and FILE__POSITION. E.g. imagine we have a table 'tbl' with schema
(i int, s string, k int), and we update the table with:

  UPDATE tbl SET k = 5 WHERE i % 100 = 11;

 The generated source statement will be ==>

  SELECT i, s, 5, INPUT__FILE__NAME, FILE__POSITION
  FROM tbl WHERE i % 100 = 11;

Then we create two table sinks that refer to expressions from the above
source statement:

  Insert sink (i, s, 5)
  Delete sink (INPUT__FILE__NAME, FILE__POSITION)

The tuples in the rowbatch of MultiDataSink contain slots for all the
above expressions (i, s, 5, INPUT__FILE__NAME, FILE__POSITION).
MultiDataSink forwards each row batch to each registered TableSink.
They will pick their relevant expressions from the tuple and write
data/delete files. The tuples are sorted by INPUTE__FILE__NAME and
FILE__POSITION because we need to write the delete records in this
order.

For partitioned tables we need to shuffle and sort the input tuples.
In this case we also add virtual columns "PARTITION__SPEC__ID" and
"ICEBERG__PARTITION__SERIALIZED" to the source statement and shuffle
and sort the rows based on them.

Data files and delete files are now separated in the DmlExecState, so
at the end of the operation we'll have two sets of files. We use these
two sets to create a new Iceberg snapshot.

Why this patch has the limitations?
 - Because we are shuffling and sorting rows based on the delete
   records and their partitions. This means that the new data files
   might not get written in an efficient way, e.g. there will be
   too many of them, or we will need to keep too many open file
   handles during writing.
   Also, if the table has SORT BY properties, we cannot respect
   it as the input rows are ordered in a way to favor the position
   deletes.
   Patch 3 will introduce a buffering writer for position delete
   files. This means we will shuffle and sort records based on
   the data records' partitions and SORT BY properties while
   delete records get buffered and written out at the end (sorted
   by file_path and position). In some edge cases the delete records
   might not get written efficiently, but it is a smaller problem
   then inefficient data files.

Testing:
 * negative tests
 * planner tests
 * update all supported data types
 * partitioned tables
 * Impala/Hive interop tests
 * authz tests
 * concurrent tests

Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
---
M be/src/exec/CMakeLists.txt
M be/src/exec/data-sink.cc
M be/src/exec/iceberg-delete-sink.cc
M be/src/exec/iceberg-delete-sink.h
A be/src/exec/multi-table-sink.cc
A be/src/exec/multi-table-sink.h
M be/src/exec/table-sink-base.cc
M be/src/exec/table-sink-base.h
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/service/client-request-state.cc
M common/protobuf/control_service.proto
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
M fe/src/main/java/org/apache/impala/analysis/DmlStatementBase.java
M fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java
M fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
A fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/i

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-11-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/14517/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 24 Nov 2023 20:13:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-11-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 3:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9958/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 24 Nov 2023 20:14:12 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-11-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 3: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/9958/


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 25 Nov 2023 00:41:11 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-11-30 Thread Daniel Becker (Code Review)
Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 3:

(50 comments)

http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@10
PS3, Line 10: any of
: the followings are
Nit: "any of the following is".


http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@46
PS3, Line 46:   Insert sink (i, s, 5)
Just an idea: How difficult / expensive would it be to check the rows we update 
whether the new row value differs from the old one? We could skip rows that 
already have the desired value.

Even if it's worth the effort it should probably be in a different change.


http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@66
PS3, Line 66: Why this patch has
Nit: "Why does this patch have..." would be better.


http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@80
PS3, Line 80: smaller problem
:then inefficient data files
Why is it a smaller problem? Is it because we expect that there is more data in 
data files compared to delete files?


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/iceberg-delete-sink.cc
File be/src/exec/iceberg-delete-sink.cc:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/iceberg-delete-sink.cc@117
PS3, Line 117: existing_partition
Not modified in this change, but does this variable do anything?


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/iceberg-delete-sink.cc@137
PS3, Line 137: dml_exec_state_
Does 'dml_exec_state_' need to be the same as 'state->dml_exec_state()', which 
was used before this patch? If so, we could add a DCHECK.


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.h
File be/src/exec/multi-table-sink.h:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.h@69
PS3, Line 69:   /// END.
I'd consider writing the explanatory comment of BEGIN also here ("Following 
methods just delegate...") because in case other methods are added it could 
become difficult to understand what END refers to.


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.cc
File be/src/exec/multi-table-sink.cc:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.cc@52
PS3, Line 52: (TableSinkBase*)
Instead of a C-style cast we could use static_cast, maybe with a DCHECK 
containing a dynamic cast.


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.cc@94
PS3, Line 94:   closed_ = true;
Shouldn't DataSink::Close() already set 'closed_' to true? We could also 
replace it with DCHECK(closed_).


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/table-sink-base.cc
File be/src/exec/table-sink-base.cc:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/table-sink-base.cc@419
PS3, Line 419:   if (dml_exec_state == nullptr) dml_exec_state = 
state->dml_exec_state();
When is it possible that 'dml_exec_state' is not null and is not the same as 
'state->dml_exec_state()'? For example, is it possible if 'is_delete' is false?


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/table-sink-base.cc@430
PS3, Line 430: IsIceberg()
If 'is_delete' is true, shouldn't this always be true also? We could add a 
DCHECK for it.


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/runtime/dml-exec-state.cc
File be/src/runtime/dml-exec-state.cc:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/runtime/dml-exec-state.cc@483
PS3, Line 483: ASDF
What does ASDF mean?


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/runtime/dml-exec-state.cc@565
PS3, Line 565: is_iceberg
Shouldn't 'is_iceberg' always be true for delete files?


http://gerrit.cloudera.org:8080/#/c/20677/3/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/20677/3/common/protobuf/control_service.proto@95
PS3, Line 95: 8
Did you intentionally skip 7?


http://gerrit.cloudera.org:8080/#/c/20677/3/common/thrift/DataSinks.thrift
File common/thrift/DataSinks.thrift:

http://gerrit.cloudera.org:8080/#/c/20677/3/common/thrift/DataSinks.thrift@187
PS3, Line 187:   10: optional list data_sinks
Maybe 'child_data_sinks' would be better.
Could also add a comment that it is used for the children of MULTI_DATA_SINKs.


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
File fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java:

http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java@65
PS3, Line 65:   private final Map additionalTargetTableIds_ = 
new HashMap<>();
Shouldn't this come directly after 'targetTable_'?


http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java
Fi

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-11-30 Thread Daniel Becker (Code Review)
Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/util/IcebergUtil.java
File fe/src/main/java/org/apache/impala/util/IcebergUtil.java:

http://gerrit.cloudera.org:8080/#/c/20677/3/fe/src/main/java/org/apache/impala/util/IcebergUtil.java@1155
PS3, Line 1155: getIcebergPartitionTransformExpr
The two 'getIcebergPartitionTransformExpr()' functions are only used here. Do 
they need to be public?



--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 30 Nov 2023 13:36:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/20677/4/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/20677/4/tests/query_test/test_iceberg.py@1321
PS4, Line 1321: d
flake8: E999 SyntaxError: invalid syntax


http://gerrit.cloudera.org:8080/#/c/20677/4/tests/stress/test_update_stress.py
File tests/stress/test_update_stress.py:

http://gerrit.cloudera.org:8080/#/c/20677/4/tests/stress/test_update_stress.py@100
PS4, Line 100: u
flake8: F821 undefined name 'updates'


http://gerrit.cloudera.org:8080/#/c/20677/4/tests/stress/test_update_stress.py@101
PS4, Line 101: u
flake8: F821 undefined name 'updates'


http://gerrit.cloudera.org:8080/#/c/20677/4/tests/stress/test_update_stress.py@102
PS4, Line 102: u
flake8: F821 undefined name 'updates'



-- 
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 01 Dec 2023 14:49:48 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-01 Thread Zoltan Borok-Nagy (Code Review)
Hello Andrew Sherman, Tamas Mate, Daniel Becker, Gabor Kaszab, Noemi 
Pap-Takacs, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20677

to look at the new patch set (#4).

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..

IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

This patch adds limited UPDATE support for Iceberg tables. The
limitations mean users cannot update Iceberg tables if any of
the following is true:
 * UPDATE value of partitioning column
 * UPDATE table that went through partition evolution
 * Table has SORT BY properties

The above limitations will be resolved by part 3. The usual limitations
like writing non-Parquet files, using copy-on-write, modifying V1 tables
are out of scope of IMPALA-12313.

This patch implements UPDATEs with the merge-on-read technique. This
means the UPDATE statement writes both data files and delete files.
Data files contain the updated records, delete files contain the
position delete records of the old data records that have been
touched.

To achieve the above this patch introduces a new sink: MultiDataSink.
We can configure multiple TableSinks for a single MultiDataSink object.
During execution, the row batches sent to the MultiDataSink will be
forwarded to all the TableSinks that have been registered.

The UPDATE statement for an Iceberg table creates a source select
statement with all table columns and virtual columns INPUT__FILE__NAME
and FILE__POSITION. E.g. imagine we have a table 'tbl' with schema
(i int, s string, k int), and we update the table with:

  UPDATE tbl SET k = 5 WHERE i % 100 = 11;

 The generated source statement will be ==>

  SELECT i, s, 5, INPUT__FILE__NAME, FILE__POSITION
  FROM tbl WHERE i % 100 = 11;

Then we create two table sinks that refer to expressions from the above
source statement:

  Insert sink (i, s, 5)
  Delete sink (INPUT__FILE__NAME, FILE__POSITION)

The tuples in the rowbatch of MultiDataSink contain slots for all the
above expressions (i, s, 5, INPUT__FILE__NAME, FILE__POSITION).
MultiDataSink forwards each row batch to each registered TableSink.
They will pick their relevant expressions from the tuple and write
data/delete files. The tuples are sorted by INPUTE__FILE__NAME and
FILE__POSITION because we need to write the delete records in this
order.

For partitioned tables we need to shuffle and sort the input tuples.
In this case we also add virtual columns "PARTITION__SPEC__ID" and
"ICEBERG__PARTITION__SERIALIZED" to the source statement and shuffle
and sort the rows based on them.

Data files and delete files are now separated in the DmlExecState, so
at the end of the operation we'll have two sets of files. We use these
two sets to create a new Iceberg snapshot.

Why does this patch has the limitations?
 - Because we are shuffling and sorting rows based on the delete
   records and their partitions. This means that the new data files
   might not get written in an efficient way, e.g. there will be
   too many of them, or we will need to keep too many open file
   handles during writing.
   Also, if the table has SORT BY properties, we cannot respect
   it as the input rows are ordered in a way to favor the position
   deletes.
   Patch 3 will introduce a buffering writer for position delete
   files. This means we will shuffle and sort records based on
   the data records' partitions and SORT BY properties while
   delete records get buffered and written out at the end (sorted
   by file_path and position). In some edge cases the delete records
   might not get written efficiently, but it is a smaller problem
   then inefficient data files.

Testing:
 * negative tests
 * planner tests
 * update all supported data types
 * partitioned tables
 * Impala/Hive interop tests
 * authz tests
 * concurrent tests

Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
---
M be/src/exec/CMakeLists.txt
M be/src/exec/data-sink.cc
M be/src/exec/iceberg-delete-sink.cc
M be/src/exec/iceberg-delete-sink.h
A be/src/exec/multi-table-sink.cc
A be/src/exec/multi-table-sink.h
M be/src/exec/table-sink-base.cc
M be/src/exec/table-sink-base.h
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/service/client-request-state.cc
M common/protobuf/control_service.proto
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
M fe/src/main/java/org/apache/impala/analysis/DmlStatementBase.java
M fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java
M fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
A fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/ja

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9977/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 01 Dec 2023 14:56:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-01 Thread Zoltan Borok-Nagy (Code Review)
Hello Andrew Sherman, Tamas Mate, Daniel Becker, Gabor Kaszab, Noemi 
Pap-Takacs, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20677

to look at the new patch set (#5).

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..

IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

This patch adds limited UPDATE support for Iceberg tables. The
limitations mean users cannot update Iceberg tables if any of
the following is true:
 * UPDATE value of partitioning column
 * UPDATE table that went through partition evolution
 * Table has SORT BY properties

The above limitations will be resolved by part 3. The usual limitations
like writing non-Parquet files, using copy-on-write, modifying V1 tables
are out of scope of IMPALA-12313.

This patch implements UPDATEs with the merge-on-read technique. This
means the UPDATE statement writes both data files and delete files.
Data files contain the updated records, delete files contain the
position delete records of the old data records that have been
touched.

To achieve the above this patch introduces a new sink: MultiDataSink.
We can configure multiple TableSinks for a single MultiDataSink object.
During execution, the row batches sent to the MultiDataSink will be
forwarded to all the TableSinks that have been registered.

The UPDATE statement for an Iceberg table creates a source select
statement with all table columns and virtual columns INPUT__FILE__NAME
and FILE__POSITION. E.g. imagine we have a table 'tbl' with schema
(i int, s string, k int), and we update the table with:

  UPDATE tbl SET k = 5 WHERE i % 100 = 11;

 The generated source statement will be ==>

  SELECT i, s, 5, INPUT__FILE__NAME, FILE__POSITION
  FROM tbl WHERE i % 100 = 11;

Then we create two table sinks that refer to expressions from the above
source statement:

  Insert sink (i, s, 5)
  Delete sink (INPUT__FILE__NAME, FILE__POSITION)

The tuples in the rowbatch of MultiDataSink contain slots for all the
above expressions (i, s, 5, INPUT__FILE__NAME, FILE__POSITION).
MultiDataSink forwards each row batch to each registered TableSink.
They will pick their relevant expressions from the tuple and write
data/delete files. The tuples are sorted by INPUTE__FILE__NAME and
FILE__POSITION because we need to write the delete records in this
order.

For partitioned tables we need to shuffle and sort the input tuples.
In this case we also add virtual columns "PARTITION__SPEC__ID" and
"ICEBERG__PARTITION__SERIALIZED" to the source statement and shuffle
and sort the rows based on them.

Data files and delete files are now separated in the DmlExecState, so
at the end of the operation we'll have two sets of files. We use these
two sets to create a new Iceberg snapshot.

Why does this patch has the limitations?
 - Because we are shuffling and sorting rows based on the delete
   records and their partitions. This means that the new data files
   might not get written in an efficient way, e.g. there will be
   too many of them, or we will need to keep too many open file
   handles during writing.
   Also, if the table has SORT BY properties, we cannot respect
   it as the input rows are ordered in a way to favor the position
   deletes.
   Patch 3 will introduce a buffering writer for position delete
   files. This means we will shuffle and sort records based on
   the data records' partitions and SORT BY properties while
   delete records get buffered and written out at the end (sorted
   by file_path and position). In some edge cases the delete records
   might not get written efficiently, but it is a smaller problem
   then inefficient data files.

Testing:
 * negative tests
 * planner tests
 * update all supported data types
 * partitioned tables
 * Impala/Hive interop tests
 * authz tests
 * concurrent tests

Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
---
M be/src/exec/CMakeLists.txt
M be/src/exec/data-sink.cc
M be/src/exec/iceberg-delete-sink.cc
M be/src/exec/iceberg-delete-sink.h
A be/src/exec/multi-table-sink.cc
A be/src/exec/multi-table-sink.h
M be/src/exec/table-sink-base.cc
M be/src/exec/table-sink-base.h
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/service/client-request-state.cc
M common/protobuf/control_service.proto
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
M fe/src/main/java/org/apache/impala/analysis/DmlStatementBase.java
M fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java
M fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
A fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/ja

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-01 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 4:

(55 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@10
PS3, Line 10: any of
: the following is t
> Nit: "any of the following is".
Done


http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@46
PS3, Line 46:   Insert sink (i, s, 5)
> Just an idea: How difficult / expensive would it be to check the rows we up
That's a good idea. In simple cases this might be doable by adding extra 
predicates:

E.g.:
  UPDATE tbl SET k = 3 WHERE i > 4;

==>

  UPDATE tbl SET k = 3 WHERE i > 4 AND k != 3;

Other than these planner-side extra predicates, I wouldn't complicate the 
backend executors. But yeah, let's think more about this and do it separately. 
Filed IMPALA-12588.


http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@66
PS3, Line 66: Why does this patc
> Nit: "Why does this patch have..." would be better.
Done


http://gerrit.cloudera.org:8080/#/c/20677/3//COMMIT_MSG@80
PS3, Line 80: smaller problem
:then inefficient data files
> Why is it a smaller problem? Is it because we expect that there is more dat
Yes, I was thinking about having as few data files as possible and having the 
data in a good order, so you get good encoding and page filtering capabilities.
I don't think the other way around, i.e. as few delete files as possible but 
more data files than necessary, would be beneficial.


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/iceberg-delete-sink.cc
File be/src/exec/iceberg-delete-sink.cc:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/iceberg-delete-sink.cc@117
PS3, Line 117: .size(), 2);
> Not modified in this change, but does this variable do anything?
Removed.


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/iceberg-delete-sink.cc@137
PS3, Line 137: "If thi
> Does 'dml_exec_state_' need to be the same as 'state->dml_exec_state()', wh
No, dml_exec_state_ is the delete sink's own object. When we are doing the 
UPDATE and having two sinks (insert sink, delete sink), it's not possible to 
use the same dml_exec_state object simultaneously without violating some 
preconditions. So IcebergDeleteSink now has its own which is merged into 
state->dml_exec_state() in Close().
Added a comment about this to make it clear.


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.h
File be/src/exec/multi-table-sink.h:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.h@69
PS3, Line 69:   /// END: Methods above just delegate calls to the child sinks 
in 'table_sinks_'.
> I'd consider writing the explanatory comment of BEGIN also here ("Following
Done


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.cc
File be/src/exec/multi-table-sink.cc:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.cc@52
PS3, Line 52:
> Instead of a C-style cast we could use static_cast, maybe with a DCHECK con
Since this is not hot code I'm using dynamic_cast and DCHECK_NOTNULL.


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/multi-table-sink.cc@94
PS3, Line 94:   DataSink::Close(state);
> Shouldn't DataSink::Close() already set 'closed_' to true? We could also re
Done


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/table-sink-base.cc
File be/src/exec/table-sink-base.cc:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/table-sink-base.cc@419
PS3, Line 419:   if (dml_exec_state == nullptr) dml_exec_state = 
state->dml_exec_state();
> When is it possible that 'dml_exec_state' is not null and is not the same a
By default we are using state->dml_exec_state(), unless the caller provides an 
explicit dml_exec_state.
Currently the Iceberg Delete Sink provides explicit dml exec state because it 
has its own. But I don't want to make assumptions here like 'is_delete' <==> 
'dml_exec_state != nullptr', because they are independent parameters, the 
connection between them might be just temporary. In fact, in part 3 the 
IcebergDeleteSink might lose its own dml_exec_state as UPDATEs will use a 
different delete sink.


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/exec/table-sink-base.cc@430
PS3, Line 430:
> If 'is_delete' is true, shouldn't this always be true also? We could add a
Done


http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/runtime/dml-exec-state.cc
File be/src/runtime/dml-exec-state.cc:

http://gerrit.cloudera.org:8080/#/c/20677/3/be/src/runtime/dml-exec-state.cc@483
PS3, Line 483: et_n
> What does ASDF mean?
ASDF does not mean anything, it's just easy to type and search for in log 
messages during debugging. Removed.


http://gerrit.cloudera.org:8080/#/c/20677/3

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 4:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/14566/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 01 Dec 2023 15:16:43 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/14567/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 01 Dec 2023 15:23:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 5: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/9977/


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 01 Dec 2023 19:22:21 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9979/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 04 Dec 2023 11:58:54 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-04 Thread Zoltan Borok-Nagy (Code Review)
Hello Andrew Sherman, Tamas Mate, Daniel Becker, Gabor Kaszab, Noemi 
Pap-Takacs, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20677

to look at the new patch set (#7).

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..

IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

This patch adds limited UPDATE support for Iceberg tables. The
limitations mean users cannot update Iceberg tables if any of
the following is true:
 * UPDATE value of partitioning column
 * UPDATE table that went through partition evolution
 * Table has SORT BY properties

The above limitations will be resolved by part 3. The usual limitations
like writing non-Parquet files, using copy-on-write, modifying V1 tables
are out of scope of IMPALA-12313.

This patch implements UPDATEs with the merge-on-read technique. This
means the UPDATE statement writes both data files and delete files.
Data files contain the updated records, delete files contain the
position delete records of the old data records that have been
touched.

To achieve the above this patch introduces a new sink: MultiDataSink.
We can configure multiple TableSinks for a single MultiDataSink object.
During execution, the row batches sent to the MultiDataSink will be
forwarded to all the TableSinks that have been registered.

The UPDATE statement for an Iceberg table creates a source select
statement with all table columns and virtual columns INPUT__FILE__NAME
and FILE__POSITION. E.g. imagine we have a table 'tbl' with schema
(i int, s string, k int), and we update the table with:

  UPDATE tbl SET k = 5 WHERE i % 100 = 11;

 The generated source statement will be ==>

  SELECT i, s, 5, INPUT__FILE__NAME, FILE__POSITION
  FROM tbl WHERE i % 100 = 11;

Then we create two table sinks that refer to expressions from the above
source statement:

  Insert sink (i, s, 5)
  Delete sink (INPUT__FILE__NAME, FILE__POSITION)

The tuples in the rowbatch of MultiDataSink contain slots for all the
above expressions (i, s, 5, INPUT__FILE__NAME, FILE__POSITION).
MultiDataSink forwards each row batch to each registered TableSink.
They will pick their relevant expressions from the tuple and write
data/delete files. The tuples are sorted by INPUTE__FILE__NAME and
FILE__POSITION because we need to write the delete records in this
order.

For partitioned tables we need to shuffle and sort the input tuples.
In this case we also add virtual columns "PARTITION__SPEC__ID" and
"ICEBERG__PARTITION__SERIALIZED" to the source statement and shuffle
and sort the rows based on them.

Data files and delete files are now separated in the DmlExecState, so
at the end of the operation we'll have two sets of files. We use these
two sets to create a new Iceberg snapshot.

Why does this patch has the limitations?
 - Because we are shuffling and sorting rows based on the delete
   records and their partitions. This means that the new data files
   might not get written in an efficient way, e.g. there will be
   too many of them, or we will need to keep too many open file
   handles during writing.
   Also, if the table has SORT BY properties, we cannot respect
   it as the input rows are ordered in a way to favor the position
   deletes.
   Patch 3 will introduce a buffering writer for position delete
   files. This means we will shuffle and sort records based on
   the data records' partitions and SORT BY properties while
   delete records get buffered and written out at the end (sorted
   by file_path and position). In some edge cases the delete records
   might not get written efficiently, but it is a smaller problem
   then inefficient data files.

Testing:
 * negative tests
 * planner tests
 * update all supported data types
 * partitioned tables
 * Impala/Hive interop tests
 * authz tests
 * concurrent tests

Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
---
M be/src/exec/CMakeLists.txt
M be/src/exec/data-sink.cc
M be/src/exec/iceberg-delete-sink.cc
M be/src/exec/iceberg-delete-sink.h
A be/src/exec/multi-table-sink.cc
A be/src/exec/multi-table-sink.h
M be/src/exec/table-sink-base.cc
M be/src/exec/table-sink-base.h
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/service/client-request-state.cc
M common/protobuf/control_service.proto
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
M fe/src/main/java/org/apache/impala/analysis/DmlStatementBase.java
M fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java
M fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
A fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/ja

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/14580/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 04 Dec 2023 12:25:30 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 7: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 04 Dec 2023 16:39:39 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-04 Thread Daniel Becker (Code Review)
Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 7:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG@49
PS7, Line 49: The tuples in the rowbatch of MultiDataSink contain slots for all 
the
Idea for the future: in case of a data and a delete sink, couldn't we only send 
the relevant fields to each sink, i.e. (i, s, 5) to the data sink and 
(INPUT__FILE__NAME, FILE__POSITION) to the delete sink? Or are we just passing 
pointers to them without copying the data?


http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG@66
PS7, Line 66: has
Nit: have.


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc
File be/src/exec/iceberg-delete-sink.cc:

http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@120
PS7, Line 120: auto
I think the concrete type (ScalarExpr*) would be more readable. Applies to the 
next line too.


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@125
PS7, Line 125:   auto filepath_eval = output_expr_evals_[0];
I think the concrete type (ScalarExprEvaluator*) would be more readable. 
Applies to the next line too.


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@131
PS7, Line 131: val
Why do we take the 'val' field? It is an int64_t which is then implicitly 
converted (back) to BigIntVal. Even if the original BigIntVal was NULL, the 
int64_t will be converted to a non-NULL BigIntVal.


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@133
PS7, Line 133: string filepath(reinterpret_cast(filepath_sv.ptr), 
filepath_sv.len);
Couldn't we store the StringVal instance in 'prev_file_path_' instead of an 
std::string? The buffer should be valid during this function, and then we 
wouldn't have to copy the string.


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@137
PS7, Line 137: UPDATE statement
We could write "UPDATE statement with a JOIN".


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@140
PS7, Line 140: prev_file_path_ = filepath;
'prev_file_path_' and 'prev_position_' are only used in this function, they 
could be local variables. Or are you concerned about the re-allocation between 
row batches?


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/table-sink-base.cc
File be/src/exec/table-sink-base.cc:

http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/table-sink-base.cc@430
PS7, Line 430:   dml_exec_state->AddCreatedDeleteFile(*partition,
Optional: we could still add DCHECK(IsIceberg()) here.


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/runtime/dml-exec-state.h
File be/src/runtime/dml-exec-state.h:

http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/runtime/dml-exec-state.h@82
PS7, Line 82:   /// 'insert_stats' contains stats for the Iceberg delete file.
Could mention that it has to be Iceberg.


http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
File fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java:

http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java@59
PS7, Line 59:
Nit: unneeded spaces.


http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
File fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java@74
PS7, Line 74:   public List getSortExprs() {
Could add @Override.


http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java
File fe/src/main/java/org/apache/impala/planner/MultiDataSink.java:

http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@132
PS7, Line 132: MAX_FS_WRITERS
I think we should also mention that 'maxNumberOfMultiSinks_' is set to 
'MAX_FS_WRITERS'.


http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test
File 
testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test:

http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@176
PS3, Line 176: update ice_alltypes set bigint_col = 33, int_col = 3, string_col 
= 'three';
> It wasn't intentional, but I'm thinking about keeping it anyway, especially
Great, this query being duplicate was the thing that gave me the idea for the 
optimisation :)


http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-06 Thread Tamas Mate (Code Review)
Tamas Mate has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 7:

(1 comment)

Great change Zoltan!
Started to review and tested it a bit.

The MultiTableSink counters in the profile seem to be off, the counters are not 
updated based on the sum of "child" sinks.

http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG@75
PS7, Line 75: Patch
nit: Part



--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 06 Dec 2023 14:16:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-06 Thread Zoltan Borok-Nagy (Code Review)
Hello Andrew Sherman, Tamas Mate, Daniel Becker, Gabor Kaszab, Noemi 
Pap-Takacs, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20677

to look at the new patch set (#8).

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..

IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

This patch adds limited UPDATE support for Iceberg tables. The
limitations mean users cannot update Iceberg tables if any of
the following is true:
 * UPDATE value of partitioning column
 * UPDATE table that went through partition evolution
 * Table has SORT BY properties

The above limitations will be resolved by part 3. The usual limitations
like writing non-Parquet files, using copy-on-write, modifying V1 tables
are out of scope of IMPALA-12313.

This patch implements UPDATEs with the merge-on-read technique. This
means the UPDATE statement writes both data files and delete files.
Data files contain the updated records, delete files contain the
position delete records of the old data records that have been
touched.

To achieve the above this patch introduces a new sink: MultiDataSink.
We can configure multiple TableSinks for a single MultiDataSink object.
During execution, the row batches sent to the MultiDataSink will be
forwarded to all the TableSinks that have been registered.

The UPDATE statement for an Iceberg table creates a source select
statement with all table columns and virtual columns INPUT__FILE__NAME
and FILE__POSITION. E.g. imagine we have a table 'tbl' with schema
(i int, s string, k int), and we update the table with:

  UPDATE tbl SET k = 5 WHERE i % 100 = 11;

 The generated source statement will be ==>

  SELECT i, s, 5, INPUT__FILE__NAME, FILE__POSITION
  FROM tbl WHERE i % 100 = 11;

Then we create two table sinks that refer to expressions from the above
source statement:

  Insert sink (i, s, 5)
  Delete sink (INPUT__FILE__NAME, FILE__POSITION)

The tuples in the rowbatch of MultiDataSink contain slots for all the
above expressions (i, s, 5, INPUT__FILE__NAME, FILE__POSITION).
MultiDataSink forwards each row batch to each registered TableSink.
They will pick their relevant expressions from the tuple and write
data/delete files. The tuples are sorted by INPUTE__FILE__NAME and
FILE__POSITION because we need to write the delete records in this
order.

For partitioned tables we need to shuffle and sort the input tuples.
In this case we also add virtual columns "PARTITION__SPEC__ID" and
"ICEBERG__PARTITION__SERIALIZED" to the source statement and shuffle
and sort the rows based on them.

Data files and delete files are now separated in the DmlExecState, so
at the end of the operation we'll have two sets of files. We use these
two sets to create a new Iceberg snapshot.

Why does this patch have the limitations?
 - Because we are shuffling and sorting rows based on the delete
   records and their partitions. This means that the new data files
   might not get written in an efficient way, e.g. there will be
   too many of them, or we will need to keep too many open file
   handles during writing.
   Also, if the table has SORT BY properties, we cannot respect
   it as the input rows are ordered in a way to favor the position
   deletes.
   Patch 3 will introduce a buffering writer for position delete
   files. This means we will shuffle and sort records based on
   the data records' partitions and SORT BY properties while
   delete records get buffered and written out at the end (sorted
   by file_path and position). In some edge cases the delete records
   might not get written efficiently, but it is a smaller problem
   then inefficient data files.

Testing:
 * negative tests
 * planner tests
 * update all supported data types
 * partitioned tables
 * Impala/Hive interop tests
 * authz tests
 * concurrent tests

Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
---
M be/src/exec/CMakeLists.txt
M be/src/exec/data-sink.cc
M be/src/exec/iceberg-delete-sink.cc
M be/src/exec/iceberg-delete-sink.h
A be/src/exec/multi-table-sink.cc
A be/src/exec/multi-table-sink.h
M be/src/exec/table-sink-base.cc
M be/src/exec/table-sink-base.h
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/service/client-request-state.cc
M common/protobuf/control_service.proto
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
M fe/src/main/java/org/apache/impala/analysis/DmlStatementBase.java
M fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java
M fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
A fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/j

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-06 Thread Zoltan Borok-Nagy (Code Review)
Hello Andrew Sherman, Tamas Mate, Daniel Becker, Gabor Kaszab, Noemi 
Pap-Takacs, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20677

to look at the new patch set (#9).

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..

IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

This patch adds limited UPDATE support for Iceberg tables. The
limitations mean users cannot update Iceberg tables if any of
the following is true:
 * UPDATE value of partitioning column
 * UPDATE table that went through partition evolution
 * Table has SORT BY properties

The above limitations will be resolved by part 3. The usual limitations
like writing non-Parquet files, using copy-on-write, modifying V1 tables
are out of scope of IMPALA-12313.

This patch implements UPDATEs with the merge-on-read technique. This
means the UPDATE statement writes both data files and delete files.
Data files contain the updated records, delete files contain the
position delete records of the old data records that have been
touched.

To achieve the above this patch introduces a new sink: MultiDataSink.
We can configure multiple TableSinks for a single MultiDataSink object.
During execution, the row batches sent to the MultiDataSink will be
forwarded to all the TableSinks that have been registered.

The UPDATE statement for an Iceberg table creates a source select
statement with all table columns and virtual columns INPUT__FILE__NAME
and FILE__POSITION. E.g. imagine we have a table 'tbl' with schema
(i int, s string, k int), and we update the table with:

  UPDATE tbl SET k = 5 WHERE i % 100 = 11;

 The generated source statement will be ==>

  SELECT i, s, 5, INPUT__FILE__NAME, FILE__POSITION
  FROM tbl WHERE i % 100 = 11;

Then we create two table sinks that refer to expressions from the above
source statement:

  Insert sink (i, s, 5)
  Delete sink (INPUT__FILE__NAME, FILE__POSITION)

The tuples in the rowbatch of MultiDataSink contain slots for all the
above expressions (i, s, 5, INPUT__FILE__NAME, FILE__POSITION).
MultiDataSink forwards each row batch to each registered TableSink.
They will pick their relevant expressions from the tuple and write
data/delete files. The tuples are sorted by INPUTE__FILE__NAME and
FILE__POSITION because we need to write the delete records in this
order.

For partitioned tables we need to shuffle and sort the input tuples.
In this case we also add virtual columns "PARTITION__SPEC__ID" and
"ICEBERG__PARTITION__SERIALIZED" to the source statement and shuffle
and sort the rows based on them.

Data files and delete files are now separated in the DmlExecState, so
at the end of the operation we'll have two sets of files. We use these
two sets to create a new Iceberg snapshot.

Why does this patch have the limitations?
 - Because we are shuffling and sorting rows based on the delete
   records and their partitions. This means that the new data files
   might not get written in an efficient way, e.g. there will be
   too many of them, or we will need to keep too many open file
   handles during writing.
   Also, if the table has SORT BY properties, we cannot respect
   it as the input rows are ordered in a way to favor the position
   deletes.
   Part 3 will introduce a buffering writer for position delete
   files. This means we will shuffle and sort records based on
   the data records' partitions and SORT BY properties while
   delete records get buffered and written out at the end (sorted
   by file_path and position). In some edge cases the delete records
   might not get written efficiently, but it is a smaller problem
   then inefficient data files.

Testing:
 * negative tests
 * planner tests
 * update all supported data types
 * partitioned tables
 * Impala/Hive interop tests
 * authz tests
 * concurrent tests

Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
---
M be/src/exec/CMakeLists.txt
M be/src/exec/data-sink.cc
M be/src/exec/iceberg-delete-sink.cc
M be/src/exec/iceberg-delete-sink.h
A be/src/exec/multi-table-sink.cc
A be/src/exec/multi-table-sink.h
M be/src/exec/table-sink-base.cc
M be/src/exec/table-sink-base.h
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/service/client-request-state.cc
M common/protobuf/control_service.proto
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
M fe/src/main/java/org/apache/impala/analysis/DmlStatementBase.java
M fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java
M fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
A fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/ja

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-06 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 9:

(15 comments)

Thanks for the comments.
I'll look at the profile counters. Maybe we can fix them in a separete patch.

http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG@49
PS7, Line 49: The tuples in the rowbatch of MultiDataSink contain slots for all 
the
> Idea for the future: in case of a data and a delete sink, couldn't we only
Inside fragments we are just passing pointers.
Across fragments we are serializing and shuffling the row batches. To send only 
the releveant fields, we would need a fork exchange operator.
In the design doc I had a note about this (see "Fork the transmission..."): 
https://docs.google.com/document/d/1GuRiJ3jjqkwINsSCKYaWwcfXHzbMrsd3WEMDOB11Xqw/edit#heading=h.5bmfhbmb4qdk


http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG@66
PS7, Line 66: hav
> Nit: have.
Done


http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG@75
PS7, Line 75: Part
> nit: Part
Done


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc
File be/src/exec/iceberg-delete-sink.cc:

http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@120
PS7, Line 120: Scal
> I think the concrete type (ScalarExpr*) would be more readable. Applies to
Done


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@125
PS7, Line 125:   ScalarExprEvaluator* filepath_eval = output_expr_evals_[0];
> I think the concrete type (ScalarExprEvaluator*) would be more readable. Ap
Done


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@131
PS7, Line 131:
> Why do we take the 'val' field? It is an int64_t which is then implicitly c
Done


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@133
PS7, Line 133: string filepath(reinterpret_cast(filepath_sv.ptr), 
filepath_sv.len);
> Couldn't we store the StringVal instance in 'prev_file_path_' instead of an
We need prev_file_path_ to be valid across invocations, e.g. when there is a 
duplicate between row batches.


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@137
PS7, Line 137: UPDATE statement
> We could write "UPDATE statement with a JOIN".
Done


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@140
PS7, Line 140: prev_file_path_ = filepath;
> 'prev_file_path_' and 'prev_position_' are only used in this function, they
We need them to keep their states across invocations, e.g. when the duplicates 
are at row batch boundaries.


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/table-sink-base.cc
File be/src/exec/table-sink-base.cc:

http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/table-sink-base.cc@430
PS7, Line 430:   DCHECK(IsIceberg());
> Optional: we could still add DCHECK(IsIceberg()) here.
Done


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/runtime/dml-exec-state.h
File be/src/runtime/dml-exec-state.h:

http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/runtime/dml-exec-state.h@82
PS7, Line 82:   /// can only be called for Iceberg tables.
> Could mention that it has to be Iceberg.
Done


http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
File fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java:

http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java@59
PS7, Line 59: //
> Nit: unneeded spaces.
Done


http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
File fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java@74
PS7, Line 74:   @Override
> Could add @Override.
Done


http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java
File fe/src/main/java/org/apache/impala/planner/MultiDataSink.java:

http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@132
PS7, Line 132: MAX_FS_WRITERS
> I think we should also mention that 'maxNumberOfMultiSinks_' is set to 'MAX
I added a comment to the variable.


http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test
File 
testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test:

http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@200
PS3, Line 200: insert into ref_table values (0, 111, 'IMPALA', '2023-11-07'), 
(3, 222, 'ICEBER

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 8:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/14598/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 06 Dec 2023 14:44:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-06 Thread Daniel Becker (Code Review)
Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 9: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 06 Dec 2023 15:23:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-07 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 9:

(4 comments)

I went through this patch mostly for my own education. Left some minor 
questions or comments, nothing serious. Great work Zoli!

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java
File fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java@54
PS9, Line 54:   abstract void substituteResultExprs(ExprSubstitutionMap smap, 
Analyzer analyzer);
nit: could you add a comment for this function


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/util/IcebergUtil.java
File fe/src/main/java/org/apache/impala/util/IcebergUtil.java:

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/util/IcebergUtil.java@1214
PS9, Line 1214:   private static Expr 
getIcebergPartitionTransformExpr(IcebergPartitionField partField,
Peter's DROP PARTITION patch added some new functionality for converting 
strings into partition values (?) if I'm not mistaken. Just mentioning it here 
to see if any of those could be re-used.


http://gerrit.cloudera.org:8080/#/c/20677/9/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test
File 
testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test:

http://gerrit.cloudera.org:8080/#/c/20677/9/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test@753
PS9, Line 753: AuthorizationException: User '$USER' does not have privileges to 
access: functional_parquet.iceberg_v2_delete_positional
Isn't the error msg misleading. It says the user doesn't have permissions on 
the table, but in fact the issue is that there is column masking on the table.

This use case anyway made me think: wouldn't it make sense to allow the user to 
update these tables if they don't change the masked columns? I can understand 
that from implementation pow this could be difficult.


http://gerrit.cloudera.org:8080/#/c/20677/9/testdata/workloads/functional-query/queries/QueryTest/ranger_row_filtering.test
File 
testdata/workloads/functional-query/queries/QueryTest/ranger_row_filtering.test:

http://gerrit.cloudera.org:8080/#/c/20677/9/testdata/workloads/functional-query/queries/QueryTest/ranger_row_filtering.test@647
PS9, Line 647: masked tables
nit: I guess these are not masked tables, but tables with row filtering.



--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 07 Dec 2023 13:36:52 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-07 Thread Tamas Mate (Code Review)
Tamas Mate has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 9:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/20677/9/be/src/exec/iceberg-delete-sink.h
File be/src/exec/iceberg-delete-sink.h:

http://gerrit.cloudera.org:8080/#/c/20677/9/be/src/exec/iceberg-delete-sink.h@81
PS9, Line 81:   /// Verifies that the row batch does not contain duplicated 
rows.
Could you add an explanation of intent, it is not clear for me why this is 
needed.


http://gerrit.cloudera.org:8080/#/c/20677/9/be/src/exec/iceberg-delete-sink.h@82
PS9, Line 82: VerifyRowBatch
Nit: For readability could you rename to something like: 
VerifyRowsNotDuplicated/VerifyRowDistinctiveness


http://gerrit.cloudera.org:8080/#/c/20677/9/be/src/exec/multi-table-sink.cc
File be/src/exec/multi-table-sink.cc:

http://gerrit.cloudera.org:8080/#/c/20677/9/be/src/exec/multi-table-sink.cc@52
PS9, Line 52: tsinkBase
nit: tsink_base


http://gerrit.cloudera.org:8080/#/c/20677/9/be/src/runtime/dml-exec-state.h
File be/src/runtime/dml-exec-state.h:

http://gerrit.cloudera.org:8080/#/c/20677/9/be/src/runtime/dml-exec-state.h@72
PS9, Line 72:   void UpdatePartition(const std::string& partition_name,
:   int64_t num_rows_delta, const DmlStatsPB* insert_stats,
:   bool is_delete = false);
nit: should fit into 2 lines


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
File fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java:

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java@120
PS9, Line 120: id
nit: ++nextTableId_ and return nextTableId_


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
File fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java@87
PS9, Line 87:   public List getDeletePartitionExprs(Analyzer analyzer)
:   throws AnalysisException {
nit: should fit into 1 line


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java@94
PS9, Line 94:   public List getDeleteResultExprs(Analyzer analyzer)
:   throws AnalysisException {
nit: should fit into 1 line


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/planner/Planner.java@185
PS9, Line 185:
nit: indentation



--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 07 Dec 2023 17:55:35 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-08 Thread Tamas Mate (Code Review)
Tamas Mate has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 9:

(1 comment)

Looks good to me!
I can plus two after these comments.

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
File fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java:

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java@63
PS9, Line 63: // Currently only Kudu tables are supported.
Not anymore :)



--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 08 Dec 2023 11:32:54 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-08 Thread Noemi Pap-Takacs (Code Review)
Noemi Pap-Takacs has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java
File fe/src/main/java/org/apache/impala/planner/MultiDataSink.java:

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@137
PS9, Line 137:   // If there are more nodes than instances where the 
fragment was initially
This comment is a bit hard to understand. It seems to me that the two parts of 
the sentence are about two different cases: 1. more and 2. less nodes than 
instances.

Maybe it was supposed to mean the following:
If there are less nodes than instances where the fragment was initially planned 
to run on, then the instances will be distributed evenly across them.

(If there are more nodes, then I guess some nodes get exactly one instance and 
some get none, so the number of instances limit the number of nodes that run 
the fragment.)

Do I not understand it correctly? It would be nice to add a clearer explanation.


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@138
PS9, Line 138: then
nit: on



--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 08 Dec 2023 16:24:30 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-08 Thread Zoltan Borok-Nagy (Code Review)
Hello Andrew Sherman, Tamas Mate, Daniel Becker, Gabor Kaszab, Noemi 
Pap-Takacs, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/20677

to look at the new patch set (#10).

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..

IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

This patch adds limited UPDATE support for Iceberg tables. The
limitations mean users cannot update Iceberg tables if any of
the following is true:
 * UPDATE value of partitioning column
 * UPDATE table that went through partition evolution
 * Table has SORT BY properties

The above limitations will be resolved by part 3. The usual limitations
like writing non-Parquet files, using copy-on-write, modifying V1 tables
are out of scope of IMPALA-12313.

This patch implements UPDATEs with the merge-on-read technique. This
means the UPDATE statement writes both data files and delete files.
Data files contain the updated records, delete files contain the
position delete records of the old data records that have been
touched.

To achieve the above this patch introduces a new sink: MultiDataSink.
We can configure multiple TableSinks for a single MultiDataSink object.
During execution, the row batches sent to the MultiDataSink will be
forwarded to all the TableSinks that have been registered.

The UPDATE statement for an Iceberg table creates a source select
statement with all table columns and virtual columns INPUT__FILE__NAME
and FILE__POSITION. E.g. imagine we have a table 'tbl' with schema
(i int, s string, k int), and we update the table with:

  UPDATE tbl SET k = 5 WHERE i % 100 = 11;

 The generated source statement will be ==>

  SELECT i, s, 5, INPUT__FILE__NAME, FILE__POSITION
  FROM tbl WHERE i % 100 = 11;

Then we create two table sinks that refer to expressions from the above
source statement:

  Insert sink (i, s, 5)
  Delete sink (INPUT__FILE__NAME, FILE__POSITION)

The tuples in the rowbatch of MultiDataSink contain slots for all the
above expressions (i, s, 5, INPUT__FILE__NAME, FILE__POSITION).
MultiDataSink forwards each row batch to each registered TableSink.
They will pick their relevant expressions from the tuple and write
data/delete files. The tuples are sorted by INPUTE__FILE__NAME and
FILE__POSITION because we need to write the delete records in this
order.

For partitioned tables we need to shuffle and sort the input tuples.
In this case we also add virtual columns "PARTITION__SPEC__ID" and
"ICEBERG__PARTITION__SERIALIZED" to the source statement and shuffle
and sort the rows based on them.

Data files and delete files are now separated in the DmlExecState, so
at the end of the operation we'll have two sets of files. We use these
two sets to create a new Iceberg snapshot.

Why does this patch have the limitations?
 - Because we are shuffling and sorting rows based on the delete
   records and their partitions. This means that the new data files
   might not get written in an efficient way, e.g. there will be
   too many of them, or we will need to keep too many open file
   handles during writing.
   Also, if the table has SORT BY properties, we cannot respect
   it as the input rows are ordered in a way to favor the position
   deletes.
   Part 3 will introduce a buffering writer for position delete
   files. This means we will shuffle and sort records based on
   the data records' partitions and SORT BY properties while
   delete records get buffered and written out at the end (sorted
   by file_path and position). In some edge cases the delete records
   might not get written efficiently, but it is a smaller problem
   then inefficient data files.

Testing:
 * negative tests
 * planner tests
 * update all supported data types
 * partitioned tables
 * Impala/Hive interop tests
 * authz tests
 * concurrent tests

Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
---
M be/src/exec/CMakeLists.txt
M be/src/exec/data-sink.cc
M be/src/exec/iceberg-delete-sink.cc
M be/src/exec/iceberg-delete-sink.h
A be/src/exec/multi-table-sink.cc
A be/src/exec/multi-table-sink.h
M be/src/exec/table-sink-base.cc
M be/src/exec/table-sink-base.h
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/service/client-request-state.cc
M common/protobuf/control_service.proto
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
M fe/src/main/java/org/apache/impala/analysis/DmlStatementBase.java
M fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java
M fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
A fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/j

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-08 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 10:

(15 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/20677/9/be/src/exec/iceberg-delete-sink.h
File be/src/exec/iceberg-delete-sink.h:

http://gerrit.cloudera.org:8080/#/c/20677/9/be/src/exec/iceberg-delete-sink.h@81
PS9, Line 81:   /// Verifies that the row batch does not contain duplicated 
rows.
> Could you add an explanation of intent, it is not clear for me why this is
Added explanation.


http://gerrit.cloudera.org:8080/#/c/20677/9/be/src/exec/iceberg-delete-sink.h@82
PS9, Line 82: the context of
> Nit: For readability could you rename to something like: VerifyRowsNotDupli
Done


http://gerrit.cloudera.org:8080/#/c/20677/9/be/src/exec/multi-table-sink.cc
File be/src/exec/multi-table-sink.cc:

http://gerrit.cloudera.org:8080/#/c/20677/9/be/src/exec/multi-table-sink.cc@52
PS9, Line 52: tsink_bas
> nit: tsink_base
Done


http://gerrit.cloudera.org:8080/#/c/20677/9/be/src/runtime/dml-exec-state.h
File be/src/runtime/dml-exec-state.h:

http://gerrit.cloudera.org:8080/#/c/20677/9/be/src/runtime/dml-exec-state.h@72
PS9, Line 72:   void UpdatePartition(const std::string& partition_name, int64_t 
num_rows_delta,
:   const DmlStatsPB* insert_stats, bool is_delete = false);
:
> nit: should fit into 2 lines
Done


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
File fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java:

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java@120
PS9, Line 120: id
> nit: ++nextTableId_ and return nextTableId_
We want to use the original value (before the increment) in the put() and 
return stmt.


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
File fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java@87
PS9, Line 87:   public List getDeletePartitionExprs(Analyzer analyzer) 
throws AnalysisException {
: if (!originalTargetTable_.is
> nit: should fit into 1 line
Done


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java@94
PS9, Line 94: return getSlotRefs(analyzer, Lists.newArrayList(
: "INPUT__FILE__NAME", "FI
> nit: should fit into 1 line
Done


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java
File fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/ModifyImpl.java@54
PS9, Line 54:   /**
> nit: could you add a comment for this function
Sure, added comment.


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
File fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java:

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java@63
PS9, Line 63: // Currently Kudu and Iceberg tables are sup
> Not anymore :)
Done :)


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java
File fe/src/main/java/org/apache/impala/planner/MultiDataSink.java:

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@137
PS9, Line 137:
> This comment is a bit hard to understand. It seems to me that the two parts
Sorry about the confusion. This was copy-pasted from HdfsTableSink, and 
unfortunately not even being used by the planner. Filed IMPALA-12587 to fix 
that, until that I think it's best to remove these unused methods to avoid 
further confusion.


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@138
PS9, Line 138:
> nit: on
Done


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/planner/Planner.java@185
PS9, Line 185:   root
> nit: indentation
Done


http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/util/IcebergUtil.java
File fe/src/main/java/org/apache/impala/util/IcebergUtil.java:

http://gerrit.cloudera.org:8080/#/c/20677/9/fe/src/main/java/org/apache/impala/util/IcebergUtil.java@1214
PS9, Line 1214:   private static Expr 
getIcebergPartitionTransformExpr(IcebergPartitionField partField,
> Peter's DROP PARTITION patch added some new fun

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-08 Thread Noemi Pap-Takacs (Code Review)
Noemi Pap-Takacs has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 10: Code-Review+1

Thanks, LGTM


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 08 Dec 2023 16:56:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/14624/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 08 Dec 2023 17:08:41 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 11: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 11
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 08 Dec 2023 17:17:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-08 Thread Daniel Becker (Code Review)
Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 10: Code-Review+2

Fantastic change, thanks!


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 08 Dec 2023 17:16:49 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 11:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9987/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 11
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 08 Dec 2023 17:17:54 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 11: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/9987/


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 11
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 08 Dec 2023 21:52:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 11:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9989/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 11
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 08 Dec 2023 22:38:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..

IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

This patch adds limited UPDATE support for Iceberg tables. The
limitations mean users cannot update Iceberg tables if any of
the following is true:
 * UPDATE value of partitioning column
 * UPDATE table that went through partition evolution
 * Table has SORT BY properties

The above limitations will be resolved by part 3. The usual limitations
like writing non-Parquet files, using copy-on-write, modifying V1 tables
are out of scope of IMPALA-12313.

This patch implements UPDATEs with the merge-on-read technique. This
means the UPDATE statement writes both data files and delete files.
Data files contain the updated records, delete files contain the
position delete records of the old data records that have been
touched.

To achieve the above this patch introduces a new sink: MultiDataSink.
We can configure multiple TableSinks for a single MultiDataSink object.
During execution, the row batches sent to the MultiDataSink will be
forwarded to all the TableSinks that have been registered.

The UPDATE statement for an Iceberg table creates a source select
statement with all table columns and virtual columns INPUT__FILE__NAME
and FILE__POSITION. E.g. imagine we have a table 'tbl' with schema
(i int, s string, k int), and we update the table with:

  UPDATE tbl SET k = 5 WHERE i % 100 = 11;

 The generated source statement will be ==>

  SELECT i, s, 5, INPUT__FILE__NAME, FILE__POSITION
  FROM tbl WHERE i % 100 = 11;

Then we create two table sinks that refer to expressions from the above
source statement:

  Insert sink (i, s, 5)
  Delete sink (INPUT__FILE__NAME, FILE__POSITION)

The tuples in the rowbatch of MultiDataSink contain slots for all the
above expressions (i, s, 5, INPUT__FILE__NAME, FILE__POSITION).
MultiDataSink forwards each row batch to each registered TableSink.
They will pick their relevant expressions from the tuple and write
data/delete files. The tuples are sorted by INPUTE__FILE__NAME and
FILE__POSITION because we need to write the delete records in this
order.

For partitioned tables we need to shuffle and sort the input tuples.
In this case we also add virtual columns "PARTITION__SPEC__ID" and
"ICEBERG__PARTITION__SERIALIZED" to the source statement and shuffle
and sort the rows based on them.

Data files and delete files are now separated in the DmlExecState, so
at the end of the operation we'll have two sets of files. We use these
two sets to create a new Iceberg snapshot.

Why does this patch have the limitations?
 - Because we are shuffling and sorting rows based on the delete
   records and their partitions. This means that the new data files
   might not get written in an efficient way, e.g. there will be
   too many of them, or we will need to keep too many open file
   handles during writing.
   Also, if the table has SORT BY properties, we cannot respect
   it as the input rows are ordered in a way to favor the position
   deletes.
   Part 3 will introduce a buffering writer for position delete
   files. This means we will shuffle and sort records based on
   the data records' partitions and SORT BY properties while
   delete records get buffered and written out at the end (sorted
   by file_path and position). In some edge cases the delete records
   might not get written efficiently, but it is a smaller problem
   then inefficient data files.

Testing:
 * negative tests
 * planner tests
 * update all supported data types
 * partitioned tables
 * Impala/Hive interop tests
 * authz tests
 * concurrent tests

Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Reviewed-on: http://gerrit.cloudera.org:8080/20677
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/exec/CMakeLists.txt
M be/src/exec/data-sink.cc
M be/src/exec/iceberg-delete-sink.cc
M be/src/exec/iceberg-delete-sink.h
A be/src/exec/multi-table-sink.cc
A be/src/exec/multi-table-sink.h
M be/src/exec/table-sink-base.cc
M be/src/exec/table-sink-base.h
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/service/client-request-state.cc
M common/protobuf/control_service.proto
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
M fe/src/main/java/org/apache/impala/analysis/DmlStatementBase.java
M fe/src/main/java/org/apache/impala/analysis/IcebergDeleteImpl.java
M fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
A fe/src/main/java/org/apache/impala/analysis/IcebergUpdateImpl.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

2023-12-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )

Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
..


Patch Set 11: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 11
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Andrew Sherman 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Noemi Pap-Takacs 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 09 Dec 2023 03:04:04 +
Gerrit-HasComments: No