aviralgarg05 opened a new pull request, #15927:
URL: https://github.com/apache/iceberg/pull/15927

   Fixes #15924
   
   ## Summary
   
   This change fixes `RewriteTablePathUtil.rewriteDVFile` so DV Puffin files 
are rewritten in a streaming fashion instead of buffering every rewritten blob 
in memory first.
   
   The previous implementation collected all rewritten `Blob` instances into a 
list and wrote them only after the read loop finished. That created unnecessary 
peak memory usage for large deletion vector files. The new implementation 
rewrites each blob and writes it directly to the destination `PuffinWriter` as 
it is read.
   
   ## What changed
   
   - Reworked `rewriteDVFile` to open the `PuffinWriter` alongside the 
`PuffinReader`.
   - Removed the intermediate `List<Blob>` accumulation.
   - Preserved the existing `referenced-data-file` path rewrite behavior for DV 
blobs.
   - Added a regression test that:
     - creates a real Puffin DV file with multiple blobs,
     - rewrites it through `RewriteTablePathUtil`,
     - verifies the rewritten blob metadata,
     - verifies the blob payloads are preserved.
   
   ## Why this fixes the issue
   
   The DV rewrite path is only supposed to update blob metadata, not 
materialize the entire file in memory. Writing each blob as soon as it is read 
keeps memory usage bounded by a single blob instead of the full DV file 
contents.
   
   ## Verification
   
   Ran the following checks successfully:
   
   - `./gradlew :iceberg-core:test --tests 
org.apache.iceberg.TestRewriteTablePathUtil`
   - `./gradlew :iceberg-core:spotlessCheck :iceberg-core:test --tests 
org.apache.iceberg.TestRewriteTablePathUtil`
   - `git diff --check`
   
   The targeted core test suite was executed three times during validation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to