Gergely Fürnstáhl has posted comments on this change. ( http://gerrit.cloudera.org:8080/19850 )
Change subject: IMPALA-11619: Improve Iceberg V2 reads with a custom Iceberg Position Delete operator ...................................................................... Patch Set 20: (31 comments) http://gerrit.cloudera.org:8080/#/c/19850/20//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19850/20//COMMIT_MSG@9 PS20, Line 9: Implemented new exec node to optimize iceberg v2 read performance. > Please elaborate on how this new operator works. Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h File be/src/exec/iceberg-delete-builder.h: http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@50 PS20, Line 50: Since it is expected to only be : /// created and used by IcebergDeletePlanNode only, the DataSinkConfig::Init() : /// and DataSinkConfig::CreateSink() are not implemented for it. > I see this comment is also present at PhjBuilderConfig, but I'm not sure if Good catch, skimmed through git history, the comment is obsolate now for PHJ too http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@57 PS20, Line 57: a > nit: an Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@95 PS20, Line 95: /// > Please write a short summary about how the builder works. Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@100 PS20, Line 100: > nit: unnecessary empty line. Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@158 PS20, Line 158: spilling partitions > This builder doesn't partition its data and doesn't spill. Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@180 PS20, Line 180: const > Could be 'static const', or 'static constexpr' Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@180 PS20, Line 180: _ > If it will be a static member then we'll not need the last '_'. Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc File be/src/exec/iceberg-delete-builder.cc: http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@21 PS20, Line 21: #include <iomanip> > Unused? Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@156 PS20, Line 156: // if(delete_scan_node == nullptr){} > Code line commented out Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@175 PS20, Line 175: auto tuple_id_ = delete_scan_node->tnode_->hdfs_scan_node.tuple_id; : auto tuple_desc_ = runtime_state_->desc_tbl().GetTupleDescriptor(tuple_id_); : DCHECK(tuple_desc_->table_desc() != NULL); : auto hdfs_table_ = : static_cast<const HdfsTableDescriptor*>(tuple_desc_->table_desc()); : HdfsPartitionDescriptor* partition_desc = : hdfs_table_->GetPartition(split.partition_id()); > These should be the same for all the splits as there's a single HMS partiti Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@184 PS20, Line 184: hdfs_table_->IsIcebergTable() > This should be a DCHECK I think. Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@200 PS20, Line 200: DCHECK(retval.second == true); > We wn't hit this in RELEASE build, so the next statement would probably cra IIRC this is not even an issue, as we can read multiple splits of the same file. http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@289 PS20, Line 289: inline bool IcebergDeleteBuilder::AppendRow( : BufferedTupleStream* stream, TupleRow* row, Status* status) { : if (LIKELY(stream->AddRow(row, status))) return true; : return false; : } > Unused? Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@331 PS20, Line 331: auto it = deleted_rows_.find(*file_path); > nit: for readability, can we move this out of the if stmt? Done, maybe in a few years... :) http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h File be/src/exec/iceberg-delete-node.h: http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@61 PS20, Line 61: s > Isn't it only one hash table? Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@68 PS20, Line 68: > + 'that'? Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@81 PS20, Line 81: > Redundant empty line Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@88 PS20, Line 88: virtual > 'virtual' keywords are redundant here Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@106 PS20, Line 106: , > nit: .? Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@119 PS20, Line 119: > Unnecessary empty line Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@141 PS20, Line 141: the > multiple 'the' Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@212 PS20, Line 212: const > Can be 'static const' or 'constexpr'. Also the last '_' is not needed in th Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc File be/src/exec/iceberg-delete-node.cc: http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@25 PS20, Line 25: #include "codegen/llvm-codegen.h" > Unused Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@163 PS20, Line 163: } : v > missing empty line Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@166 PS20, Line 166: attached attached > multiple 'attached' Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@218 PS20, Line 218: // Putting SCOPED_TIMER in the IR version of ProcessProbeBatch() causes weird exception : // handling IR in the xcompiled function, so call it here instead. > This operator doesn't do codegen. Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@333 PS20, Line 333: namespace impala { > This line could be moved to L46 where we have 'using namespace impala' Done http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@347 PS20, Line 347: current_probe_pos_ = *probe_pos; > There could be a fast path when file_path == previous_file_path_ As we discussed. that would needs a full string comparison, which would be somewhat faster on the fast path (we spare a hashing and possible (but unlikely) collision resolutions), but a costly on the slow path. It still worth it if we assume there are bunch of deletes per file http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@354 PS20, Line 354: current_file_path_ != previous_file_path_ > Why do we need both 'current_file_path_' when we also have the parameter 'f If we use the objects from the unordered map, we can compare the pointers, file_path coming from the Tuple http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@380 PS20, Line 380: current_deleted_pos_row_id_ > To me it seems like that the code has the assumption that the file position Done -- To view, visit http://gerrit.cloudera.org:8080/19850 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I024a61573c83bda5584f243c879d9ff39dd2dcfa Gerrit-Change-Number: 19850 Gerrit-PatchSet: 20 Gerrit-Owner: Gergely Fürnstáhl <[email protected]> Gerrit-Reviewer: Anonymous Coward <[email protected]> Gerrit-Reviewer: Gergely Fürnstáhl <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Wed, 28 Jun 2023 13:21:57 +0000 Gerrit-HasComments: Yes
