Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/24071 )
Change subject: IMPALA-14755:(part 1) Implement Puffin Blob reader and File writer ...................................................................... Patch Set 4: (5 comments) Thanks for applying the changes! Looks good overall, though I think I'll need at least another iteration to carefully read it through. http://gerrit.cloudera.org:8080/#/c/24071/4/be/src/exec/output-partition.h File be/src/exec/output-partition.h: http://gerrit.cloudera.org:8080/#/c/24071/4/be/src/exec/output-partition.h@130 PS4, Line 130: nit: requires +4 indent http://gerrit.cloudera.org:8080/#/c/24071/4/be/src/exec/puffin/puffin-writer.cc File be/src/exec/puffin/puffin-writer.cc: http://gerrit.cloudera.org:8080/#/c/24071/4/be/src/exec/puffin/puffin-writer.cc@60 PS4, Line 60: {} Should we call Close() just in case? http://gerrit.cloudera.org:8080/#/c/24071/4/be/src/exec/puffin/puffin-writer.cc@137 PS4, Line 137: SCOPED_TIMER(parent_->hdfs_write_timer()); I think we should start this timer after we finished loading the old DVs. http://gerrit.cloudera.org:8080/#/c/24071/3/be/src/util/hash-util.h File be/src/util/hash-util.h: http://gerrit.cloudera.org:8080/#/c/24071/3/be/src/util/hash-util.h@338 PS3, Line 338: } else { : temp >>= 1; : } : } : return temp; : } : : t > It's originated from a stackoverflow post (https://stackoverflow.com/questi I think the current version (PS4) is OK, the namespace is annotated. http://gerrit.cloudera.org:8080/#/c/24071/3/common/thrift/CatalogObjects.thrift File common/thrift/CatalogObjects.thrift: http://gerrit.cloudera.org:8080/#/c/24071/3/common/thrift/CatalogObjects.thrift@679 PS3, Line 679: string > We have to store the referenced data file path information in the blob meta I think there're two possiblities: 1. Construct the "map<string, Types.TIcebergDeletionVector> deletion_vectors" for TIcebergDeleteSink on the fly with the help of the IcebergContentFileStore. This map is only needed for the duration of the DML query planning. 2. We already have the xxHash library available for the backend (though maybe we could update it?). So we can use XXH128 on the file path from the RowBatch, then lookup the DV in the map. Probably 2. is the better as it significantly reduces memory at all places. -- To view, visit http://gerrit.cloudera.org:8080/24071 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I068a071f9db907064ccec8568db5234863eb4587 Gerrit-Change-Number: 24071 Gerrit-PatchSet: 4 Gerrit-Owner: Peter Rozsa <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 12 Mar 2026 18:19:44 +0000 Gerrit-HasComments: Yes
