jordepic opened a new pull request, #15970:
URL: https://github.com/apache/iceberg/pull/15970

   Equality delete files require an expensive hash-join at read time, degrading 
scan performance proportionally to mutation volume. This adds a Spark stored 
procedure and action that converts equality deletes to deletion vectors 
(Roaring bitmaps in Puffin files), eliminating the join at read time.
   
   Key design decisions:
   - Per-partition processing with configurable parallelism to bound memory and 
enable progress isolation
   - Files read via FormatModelRegistry with field-ID matching, supporting 
Parquet (vectorized), ORC (vectorized), and Avro
   - Equality-schema groups joined independently within each partition, then 
positions unioned, deduplicated, and written as DVs
   - Join strategy configurable (auto/broadcast/hash/sort-merge)
   - Filter expression support via inclusive partition projection
   - Existing DVs merged during write
   - DVs written in the executors themselves to avoid pressure on driver


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to