geruh opened a new pull request, #2255: URL: https://github.com/apache/iceberg-python/pull/2255
Closes #1210 # Summary This work was primarily done by @rutb327 while I provided guidance! This PR adds equality delete read support to PyIceberg by implementing the delete file indexing system that matches delete files to data files, mimicking the behavior found in [Iceberg Core](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/DeleteFileIndex.java). With this implementation we are able to index files and now read equality deletes during table scans. ## Design details ### Delete File Index The new `DeleteFileIndex` class centralizes handling of all delete file types: positional deletes, equality deletes, and deletion vectors. It organizes deletes by type (equality vs. positional), partition (using `PartitionMap` for spec-aware grouping), and path (for path-specific positional deletes). This enables efficient lookup during table scans, reducing unnecessary delete file processing. ## Equality Delete support Equality delete files are loaded as PyArrow Tables with their respective equality ids for the schema and for each we are grouping tables with the same set equality id's to reduce anti join operations. # Testing Added tests from the core iceberg [DeleteFileIndex](https://github.com/apache/iceberg/blob/main/core/src/test/java/org/apache/iceberg/DeleteFileIndexTestBase.java#L45) test suite and added some tests with dummy files. As well as some manual testing with a flink setup. ``` table_eq with only equality deletes on id=2, id=5 +---+-------+ | id| data| +---+-------+ | 1| Alice| | 3|Charlie| | 4| David| | 6| Frank| +---+-------+ table_eq_pos with equality deletes and positional delete at position 3 +---+-----+ | id| data| +---+-----+ | 1|Alice| | 4|David| | 6|Frank| +---+-----+ ``` # Are there any user-facing changes? Yes can read tables with equality deletes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
