[PR] Spark: Add convert_equality_deletes procedure and action [iceberg]

via GitHub Tue, 14 Apr 2026 08:23:32 -0700


jordepic opened a new pull request, #15970:
URL: https://github.com/apache/iceberg/pull/15970


   Equality delete files require an expensive hash-join at read time, degrading 
scan performance proportionally to mutation volume. This adds a Spark stored 
procedure and action that converts equality deletes to deletion vectors 
(Roaring bitmaps in Puffin files), eliminating the join at read time.
   
   Key design decisions:
   - Per-partition processing with configurable parallelism to bound memory and 
enable progress isolation
   - Files read via FormatModelRegistry with field-ID matching, supporting 
Parquet (vectorized), ORC (vectorized), and Avro
   - Equality-schema groups joined independently within each partition, then 
positions unioned, deduplicated, and written as DVs
   - Join strategy configurable (auto/broadcast/hash/sort-merge)
   - Filter expression support via inclusive partition projection
   - Existing DVs merged during write
   - DVs written in the executors themselves to avoid pressure on driver


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Spark: Add convert_equality_deletes procedure and action [iceberg]

Reply via email to