Peter Rozsa has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/24071 )

Change subject: IMPALA-14755:(part 1) Implement Puffin Blob reader and File 
writer
......................................................................


Patch Set 6:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/output-partition.h
File be/src/exec/output-partition.h:

http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/output-partition.h@132
PS5, Line 132:
             :
> nit: Since you always create an object, can this be a normal field? I.e.:
Done


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/blob.h
File be/src/exec/puffin/blob.h:

http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/blob.h@25
PS5, Line 25:
> nit: add empty line before namespace impala.
Done


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.h
File be/src/exec/puffin/puffin-writer.h:

http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.h@107
PS5, Line 107:
> nit: we probably don't want to mention this implementation detail in the fu
Done


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.h@119
PS5, Line 119:
> nit: this seems unnecessary
Done


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc
File be/src/exec/puffin/puffin-writer.cc:

http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@107
PS5, Line 107: Status PuffinWriter::AppendRows(
             :     RowBatch* batch, const std::vector<int32_t>& row_group_i
> The comment fits L111-113 better.
Done


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@122
PS5, Line 122:   // flush via AddElements when the file changes or the batch 
ends.
             :   std::vector<uint64_t> pending_positions;
             :   pending_positions.reserve(POSITIONS_BUFFER_INITIAL_CAPACITY);
             :
             :   auto flush_pending = [&]() {
             :     if (pending_positions.empty()) return;
             :     last_bitmap_it_->second.AddElements(pending_positions);
             :     output_->current_file_rows += pending_positions.size();
             :     pending_positions.clear();
             :   };
> Do we know how efficient it is for large-scale deletes? I'd assume we can s
Done


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@148
PS5, Line 148:     // flush the accumulated positions for the outgoing file, 
then look up (or insert)
> line too long (92 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@148
PS5, Line 148:     // flush the accumulated positions for the outgoing file, 
then look up (or insert)
             :     // the entry in the map and refresh the cache.
             :     if (last_bitmap_it_ == deletion_vectors_.end() ||
             :         last_filepath_ != filepath_sv_view) {
> nit: this could go to a utility method.
Done


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/exec/puffin/puffin-writer.cc@149
PS5, Line 149:     // the entry in the map and refresh the cache.
             :     if (last_bitmap_it_ == deletion_vectors_.end() ||
             :         last_filepath_ != filepath_sv_view) {
> nit: should we swap high/low here and in IcebergUtil? It's just odd to see
Done, the already existing getFilePathHash in Hash128 used a reversed order, I 
adjusted it to be in line with the naming.


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/util/thash128-util.h
File be/src/util/thash128-util.h:

http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/util/thash128-util.h@20
PS5, Line 20: #include <string>
> Is it needed?
Done


http://gerrit.cloudera.org:8080/#/c/24071/5/be/src/util/thash128-util.h@41
PS5, Line 41: / Equa
> nit: 'inline' keyword is redundant here, functions defined in struct/class
Done


http://gerrit.cloudera.org:8080/#/c/24071/5/common/fbs/IcebergObjects.fbs
File common/fbs/IcebergObjects.fbs:

http://gerrit.cloudera.org:8080/#/c/24071/5/common/fbs/IcebergObjects.fbs@87
PS5, Line 87:   referenced_data_file_hash_high: long;
            :   referenced_data_file_hash_low: long;
> When are these being used? If only in a later patch, could these be added l
Yes, this is used in part 2 to locate the referenced data file during deletion 
vector adding/removal. I think it's easier to add every proto change in this 
patch because it makes part 2 slightly smaller.


http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/CatalogObjects.thrift@679
PS5, Line 679:   10: optional map<THash128, Types.TIcebergDeletionVector> 
data_path_hash_to_dv
> Changing type of field 'data_path_hash_to_dv' from map<THash128,TIcebergDel
Ack


http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/DataSinks.thrift
File common/thrift/DataSinks.thrift:

http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/DataSinks.thrift@124
PS5, Line 124:   2: optional map<CatalogObjects.THash128, 
Types.TIcebergDeletionVector> deletion_vectors;
> line too long (91 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/DataSinks.thrift@124
PS5, Line 124: d
> nit: redundant space
Done


http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/Types.thrift
File common/thrift/Types.thrift:

http://gerrit.cloudera.org:8080/#/c/24071/5/common/thrift/Types.thrift@307
PS5, Line 307: // Number of rows deleted by
> Could you please add a comment why and when it is needed?
Done


http://gerrit.cloudera.org:8080/#/c/24071/5/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java:

http://gerrit.cloudera.org:8080/#/c/24071/5/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@92
PS5, Line 92: "Puffin", "Puffi
> Could be "Puffin", "Puffin", "Puffin" instead, in case if it appears somewh
Done



--
To view, visit http://gerrit.cloudera.org:8080/24071
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I068a071f9db907064ccec8568db5234863eb4587
Gerrit-Change-Number: 24071
Gerrit-PatchSet: 6
Gerrit-Owner: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 19 Mar 2026 15:18:11 +0000
Gerrit-HasComments: Yes

Reply via email to