tusharchou opened a new pull request, #3149:
URL: https://github.com/apache/iceberg-python/pull/3149
## Rationale
While reviewing PR #3011 (manifest pruning optimization), I identified a
correctness
gap when tables have undergone partition spec evolution.
When `dynamic_partition_overwrite` is called on a table with mixed
`partition_spec_id`s
in its snapshot, the delete predicate was built using only the **current**
partition spec.
This caused `inclusive_projection` to fail silently when evaluating older
manifests —
the predicate contained field references (e.g. `region`) that have no
corresponding
partition field in the old spec, causing the manifest evaluator to skip
those manifests
entirely. The result is silent data duplication: stale rows from old spec
manifests are
never deleted.
## Changes
- `pyiceberg/table/__init__.py`: `dynamic_partition_overwrite` now iterates
over all
`partition_spec_id`s present in the current snapshot and builds a per-spec
delete
predicate, projecting the new data files' partition values into each
historical spec's
coordinate space before evaluating.
- `tests/integration/test_manifest_pruning_spec_evolution.py`: two
regression tests added:
1. Mixed-spec snapshot — overwrite a partition present under both spec-0
and spec-1
2. Overwrite a partition that exists **only** in spec-0 manifests (the
silent data
duplication case — no exception raised, wrong rows survive)
## Are these changes tested?
Yes — two new integration tests using the SQLite in-memory catalog, no
external
services required.
## Are there any user-facing changes?
Yes — `dynamic_partition_overwrite` now correctly deletes all matching rows
across
all historical partition specs, fixing silent data duplication on evolved
tables.
## Related
- Fixes #3148
- Related to #3011 (manifest pruning optimization that exposed this gap)
- Related to #1108 (prior spec evolution fix in manifest rewriting by @Fokko)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]