Peter Rozsa has posted comments on this change. ( http://gerrit.cloudera.org:8080/24071 )
Change subject: IMPALA-14755:(part 1) Implement Puffin Blob reader and File writer ...................................................................... Patch Set 6: (17 comments) http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/output-partition.h File be/src/exec/output-partition.h: http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/output-partition.h@132 PS5, Line 132: : > nit: Since you always create an object, can this be a normal field? I.e.: Done http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/blob.h File be/src/exec/puffin/blob.h: http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/blob.h@25 PS5, Line 25: > nit: add empty line before namespace impala. Done http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.h File be/src/exec/puffin/puffin-writer.h: http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.h@107 PS5, Line 107: > nit: we probably don't want to mention this implementation detail in the fu Done http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.h@119 PS5, Line 119: > nit: this seems unnecessary Done http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc File be/src/exec/puffin/puffin-writer.cc: http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@107 PS5, Line 107: Status PuffinWriter::AppendRows( : RowBatch* batch, const std::vector<int32_t>& row_group_i > The comment fits L111-113 better. Done http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@122 PS5, Line 122: // flush via AddElements when the file changes or the batch ends. : std::vector<uint64_t> pending_positions; : pending_positions.reserve(POSITIONS_BUFFER_INITIAL_CAPACITY); : : auto flush_pending = [&]() { : if (pending_positions.empty()) return; : last_bitmap_it_->second.AddElements(pending_positions); : output_->current_file_rows += pending_positions.size(); : pending_positions.clear(); : }; > Do we know how efficient it is for large-scale deletes? I'd assume we can s Done http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@148 PS5, Line 148: // flush the accumulated positions for the outgoing file, then look up (or insert) > line too long (92 > 90) Done http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@148 PS5, Line 148: // flush the accumulated positions for the outgoing file, then look up (or insert) : // the entry in the map and refresh the cache. : if (last_bitmap_it_ == deletion_vectors_.end() || : last_filepath_ != filepath_sv_view) { > nit: this could go to a utility method. Done http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@149 PS5, Line 149: // the entry in the map and refresh the cache. : if (last_bitmap_it_ == deletion_vectors_.end() || : last_filepath_ != filepath_sv_view) { > nit: should we swap high/low here and in IcebergUtil? It's just odd to see Done, the already existing getFilePathHash in Hash128 used a reversed order, I adjusted it to be in line with the naming. http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/util/thash128-util.h File be/src/util/thash128-util.h: http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/util/thash128-util.h@20 PS5, Line 20: #include <string> > Is it needed? Done http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/util/thash128-util.h@41 PS5, Line 41: / Equa > nit: 'inline' keyword is redundant here, functions defined in struct/class Done http://gerrit.cloudera.org:8080/#/c/24071/5/common/fbs/IcebergObjects.fbs File common/fbs/IcebergObjects.fbs: http://gerrit.cloudera.org:8080/#/c/24071/5/common/fbs/IcebergObjects.fbs@87 PS5, Line 87: referenced_data_file_hash_high: long; : referenced_data_file_hash_low: long; > When are these being used? If only in a later patch, could these be added l Yes, this is used in part 2 to locate the referenced data file during deletion vector adding/removal. I think it's easier to add every proto change in this patch because it makes part 2 slightly smaller. http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/CatalogObjects.thrift File common/thrift/CatalogObjects.thrift: http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/CatalogObjects.thrift@679 PS5, Line 679: 10: optional map<THash128, Types.TIcebergDeletionVector> data_path_hash_to_dv > Changing type of field 'data_path_hash_to_dv' from map<THash128,TIcebergDel Ack http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/DataSinks.thrift File common/thrift/DataSinks.thrift: http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/DataSinks.thrift@124 PS5, Line 124: 2: optional map<CatalogObjects.THash128, Types.TIcebergDeletionVector> deletion_vectors; > line too long (91 > 90) Done http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/DataSinks.thrift@124 PS5, Line 124: d > nit: redundant space Done http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/Types.thrift File common/thrift/Types.thrift: http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/Types.thrift@307 PS5, Line 307: // Number of rows deleted by > Could you please add a comment why and when it is needed? Done http://gerrit.cloudera.org:8080/#/c/24071/5/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java: http://gerrit.cloudera.org:8080/#/c/24071/5/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@92 PS5, Line 92: "Puffin", "Puffi > Could be "Puffin", "Puffin", "Puffin" instead, in case if it appears somewh Done -- To view, visit http://gerrit.cloudera.org:8080/24071 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I068a071f9db907064ccec8568db5234863eb4587 Gerrit-Change-Number: 24071 Gerrit-PatchSet: 6 Gerrit-Owner: Peter Rozsa <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 19 Mar 2026 15:18:11 +0000 Gerrit-HasComments: Yes
