Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/20677 )
Change subject: IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables ...................................................................... Patch Set 7: (18 comments) http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG@49 PS7, Line 49: The tuples in the rowbatch of MultiDataSink contain slots for all the Idea for the future: in case of a data and a delete sink, couldn't we only send the relevant fields to each sink, i.e. (i, s, 5) to the data sink and (INPUT__FILE__NAME, FILE__POSITION) to the delete sink? Or are we just passing pointers to them without copying the data? http://gerrit.cloudera.org:8080/#/c/20677/7//COMMIT_MSG@66 PS7, Line 66: has Nit: have. http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc File be/src/exec/iceberg-delete-sink.cc: http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@120 PS7, Line 120: auto I think the concrete type (ScalarExpr*) would be more readable. Applies to the next line too. http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@125 PS7, Line 125: auto filepath_eval = output_expr_evals_[0]; I think the concrete type (ScalarExprEvaluator*) would be more readable. Applies to the next line too. http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@131 PS7, Line 131: val Why do we take the 'val' field? It is an int64_t which is then implicitly converted (back) to BigIntVal. Even if the original BigIntVal was NULL, the int64_t will be converted to a non-NULL BigIntVal. http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@133 PS7, Line 133: string filepath(reinterpret_cast<char*>(filepath_sv.ptr), filepath_sv.len); Couldn't we store the StringVal instance in 'prev_file_path_' instead of an std::string? The buffer should be valid during this function, and then we wouldn't have to copy the string. http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@137 PS7, Line 137: UPDATE statement We could write "UPDATE statement with a JOIN". http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/iceberg-delete-sink.cc@140 PS7, Line 140: prev_file_path_ = filepath; 'prev_file_path_' and 'prev_position_' are only used in this function, they could be local variables. Or are you concerned about the re-allocation between row batches? http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/table-sink-base.cc File be/src/exec/table-sink-base.cc: http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/exec/table-sink-base.cc@430 PS7, Line 430: dml_exec_state->AddCreatedDeleteFile(*partition, Optional: we could still add DCHECK(IsIceberg()) here. http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/runtime/dml-exec-state.h File be/src/runtime/dml-exec-state.h: http://gerrit.cloudera.org:8080/#/c/20677/7/be/src/runtime/dml-exec-state.h@82 PS7, Line 82: /// 'insert_stats' contains stats for the Iceberg delete file. Could mention that it has to be Iceberg. http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java File fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java: http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/DescriptorTable.java@59 PS7, Line 59: Nit: unneeded spaces. http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java File fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java: http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/analysis/IcebergModifyImpl.java@74 PS7, Line 74: public List<Expr> getSortExprs() { Could add @Override. http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java File fe/src/main/java/org/apache/impala/planner/MultiDataSink.java: http://gerrit.cloudera.org:8080/#/c/20677/7/fe/src/main/java/org/apache/impala/planner/MultiDataSink.java@132 PS7, Line 132: MAX_FS_WRITERS I think we should also mention that 'maxNumberOfMultiSinks_' is set to 'MAX_FS_WRITERS'. http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test File testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test: http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@176 PS3, Line 176: update ice_alltypes set bigint_col = 33, int_col = 3, string_col = 'three'; > It wasn't intentional, but I'm thinking about keeping it anyway, especially Great, this query being duplicate was the thing that gave me the idea for the optimisation :) http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/iceberg-update-basic.test@200 PS3, Line 200: update ice_alltypes set bigint_col=bi, string_col=s, date_col=d from ice_alltypes, ref_table where int_col = i; > Good catch. It is currently problematic with both Kudu/Iceberg. Kudu random Optional: we could insert reminder TODOs about that for part 3. Also, we could already introduce a test where the query would produce a wrong result after part 3 (if we didn't disable this case). http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test File testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test: http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test@750 PS3, Line 750: Deletes > I followed the above pattern (see INSERT/TRUNCATE/COMPUTE STATS). Changed t Ok, I'm ok with staying with "update". In this case we could say "update" means any modification, not only an "update" statement, although it is a bit confusing. But if the other tests use it in this sense, we can decide to leave it as it was. http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test File testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test: http://gerrit.cloudera.org:8080/#/c/20677/3/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking_and_row_filtering.test@189 PS3, Line 189: > I followed the above pattern, anyway, changed the new comments. See comment on ranger_column_masking.test:750. http://gerrit.cloudera.org:8080/#/c/20677/3/tests/query_test/test_iceberg.py File tests/query_test/test_iceberg.py: http://gerrit.cloudera.org:8080/#/c/20677/3/tests/query_test/test_iceberg.py@1193 PS3, Line 1193: > It might be shorter, but the current version is more readable IMO, and more Ok. -- To view, visit http://gerrit.cloudera.org:8080/20677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08 Gerrit-Change-Number: 20677 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Noemi Pap-Takacs <npaptak...@cloudera.com> Gerrit-Reviewer: Tamas Mate <tma...@apache.org> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Mon, 04 Dec 2023 16:57:25 +0000 Gerrit-HasComments: Yes