jordepic opened a new pull request, #15970: URL: https://github.com/apache/iceberg/pull/15970
Equality delete files require an expensive hash-join at read time, degrading scan performance proportionally to mutation volume. This adds a Spark stored procedure and action that converts equality deletes to deletion vectors (Roaring bitmaps in Puffin files), eliminating the join at read time. Key design decisions: - Per-partition processing with configurable parallelism to bound memory and enable progress isolation - Files read via FormatModelRegistry with field-ID matching, supporting Parquet (vectorized), ORC (vectorized), and Avro - Equality-schema groups joined independently within each partition, then positions unioned, deduplicated, and written as DVs - Join strategy configurable (auto/broadcast/hash/sort-merge) - Filter expression support via inclusive partition projection - Existing DVs merged during write - DVs written in the executors themselves to avoid pressure on driver -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
