[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

Daniel Becker (Code Review) Mon, 04 Dec 2023 08:57:37 -0800

Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20677 )


Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables
......................................................................


Patch Set 7:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG@49
PS7, Line 49: The tuples in the rowbatch of MultiDataSink contain slots for all 
the
Idea for the future: in case of a data and a delete sink, couldn't we only send 
the relevant fields to each sink, i.e. (i, s, 5) to the data sink and 
(INPUT__FILE__NAME, FILE__POSITION) to the delete sink? Or are we just passing 
pointers to them without copying the data?


http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG@66
PS7, Line 66: has
Nit: have.


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc
File be/src/exec/iceberg-delete-sink.cc:

http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@120
PS7, Line 120: auto
I think the concrete type (ScalarExpr*) would be more readable. Applies to the 
next line too.


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@125
PS7, Line 125:   auto filepath_eval = output_expr_evals_[0];
I think the concrete type (ScalarExprEvaluator*) would be more readable. 
Applies to the next line too.


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@131
PS7, Line 131: val
Why do we take the 'val' field? It is an int64_t which is then implicitly 
converted (back) to BigIntVal. Even if the original BigIntVal was NULL, the 
int64_t will be converted to a non-NULL BigIntVal.


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@133
PS7, Line 133:     string filepath(reinterpret_cast<char*>(filepath_sv.ptr), 
filepath_sv.len);
Couldn't we store the StringVal instance in 'prev_file_path_' instead of an 
std::string? The buffer should be valid during this function, and then we 
wouldn't have to copy the string.


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@137
PS7, Line 137: UPDATE statement
We could write "UPDATE statement with a JOIN".


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@140
PS7, Line 140:     prev_file_path_ = filepath;
'prev_file_path_' and 'prev_position_' are only used in this function, they 
could be local variables. Or are you concerned about the re-allocation between 
row batches?


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/table-sink-base.cc
File be/src/exec/table-sink-base.cc:

http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/table-sink-base.cc@430
PS7, Line 430:       dml_exec_state->AddCreatedDeleteFile(*partition,
Optional: we could still add DCHECK(IsIceberg()) here.


http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/runtime/dml-exec-state.h
File be/src/runtime/dml-exec-state.h:

http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/runtime/dml-exec-state.h@82
PS7, Line 82:   /// 'insert_stats' contains stats for the Iceberg delete file.
Could mention that it has to be Iceberg.


http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java
File fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java:

http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java@59
PS7, Line 59:
Nit: unneeded spaces.


http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java
File fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java:

http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java@74
PS7, Line 74:   public List<Expr> getSortExprs() {
Could add @Override.


http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java
File fe/src/main/java/org/apache/impala/planner/MultiDataSink.java:

http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@132
PS7, Line 132: MAX_FS_WRITERS
I think we should also mention that 'maxNumberOfMultiSinks_' is set to 
'MAX_FS_WRITERS'.


http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test
File 
testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test:

http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@176
PS3, Line 176: update ice_alltypes set bigint_col = 33, int_col = 3, string_col 
= 'three';
> It wasn't intentional, but I'm thinking about keeping it anyway, especially
Great, this query being duplicate was the thing that gave me the idea for the 
optimisation :)


http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@200
PS3, Line 200: update ice_alltypes set bigint_col=bi, string_col=s, date_col=d 
from ice_alltypes, ref_table where int_col = i;
> Good catch. It is currently problematic with both Kudu/Iceberg. Kudu random
Optional: we could insert reminder TODOs about that for part 3.
Also, we could already introduce a test where the query would produce a wrong 
result after part 3 (if we didn't disable this case).


http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test
File 
testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test:

http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test@750
PS3, Line 750: Deletes
> I followed the above pattern (see INSERT/TRUNCATE/COMPUTE STATS). Changed t
Ok, I'm ok with staying with "update". In this case we could say "update" means 
any modification, not only an "update" statement, although it is a bit 
confusing. But if the other tests use it in this sense, we can decide to leave 
it as it was.


http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test
File 
testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test:

http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test@189
PS3, Line 189:
> I followed the above pattern, anyway, changed the new comments.
See comment on ranger_column_masking.test:750.


http://gerrit.cloudera.org:8080/#/c/20677/3/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/20677/3/tests/query_test/test_iceberg.py@1193
PS3, Line 1193:
> It might be shorter, but the current version is more readable IMO, and more
Ok.



--
To view, visit http://gerrit.cloudera.org:8080/20677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Gerrit-Change-Number: 20677
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tma...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Dec 2023 16:57:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables

Reply via email to