Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/24071 )

Change subject: IMPALA-14755:(part 1) Implement Puffin Blob reader and File 
writer
......................................................................


Patch Set 4:

(5 comments)

Thanks for applying the changes! Looks good overall, though I think I'll need 
at least another iteration to carefully read it through.

http://gerrit.cloudera.org:8080/#/c/24071/4/be/src/exec/output-partition.h
File be/src/exec/output-partition.h:

http://gerrit.cloudera.org:8080/#/c/24071/4/be/src/exec/output-partition.h@130
PS4, Line 130:
nit: requires +4 indent


http://gerrit.cloudera.org:8080/#/c/24071/4/be/src/exec/puffin/puffin-writer.cc
File be/src/exec/puffin/puffin-writer.cc:

http://gerrit.cloudera.org:8080/#/c/24071/4/be/src/exec/puffin/puffin-writer.cc@60
PS4, Line 60: {}
Should we call Close() just in case?


http://gerrit.cloudera.org:8080/#/c/24071/4/be/src/exec/puffin/puffin-writer.cc@137
PS4, Line 137:   SCOPED_TIMER(parent_->hdfs_write_timer());
I think we should start this timer after we finished loading the old DVs.


http://gerrit.cloudera.org:8080/#/c/24071/3/be/src/util/hash-util.h
File be/src/util/hash-util.h:

http://gerrit.cloudera.org:8080/#/c/24071/3/be/src/util/hash-util.h@338
PS3, Line 338:       } else {
             :         temp >>= 1;
             :       }
             :     }
             :     return temp;
             :   }
             :
             :   t
> It's originated from a stackoverflow post (https://stackoverflow.com/questi
I think the current version (PS4) is OK, the namespace is annotated.


http://gerrit.cloudera.org:8080/#/c/24071/3/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

http://gerrit.cloudera.org:8080/#/c/24071/3/common/thrift/CatalogObjects.thrift@679
PS3, Line 679: string
> We have to store the referenced data file path information in the blob meta
I think there're two possiblities:

1. Construct the "map<string, Types.TIcebergDeletionVector>  deletion_vectors" 
for TIcebergDeleteSink on the fly with the help of the IcebergContentFileStore. 
This map is only needed for the duration of the DML query planning.

2. We already have the xxHash library available for the backend (though maybe 
we could update it?). So we can use XXH128 on the file path from the RowBatch, 
then lookup the DV in the map.

Probably 2. is the better as it significantly reduces memory at all places.



--
To view, visit http://gerrit.cloudera.org:8080/24071
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I068a071f9db907064ccec8568db5234863eb4587
Gerrit-Change-Number: 24071
Gerrit-PatchSet: 4
Gerrit-Owner: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 12 Mar 2026 18:19:44 +0000
Gerrit-HasComments: Yes

Reply via email to