tusharchou opened a new issue, #3148:
URL: https://github.com/apache/iceberg-python/issues/3148

   ## Summary
   
   `dynamic_partition_overwrite` produces incorrect results when a table has 
undergone
   partition spec evolution. Manifests written under older specs are silently 
skipped
   by the manifest pruning logic introduced in #3011, leaving stale data files 
that
   should have been deleted.
   
   ## Root cause
   
   In `Table.dynamic_partition_overwrite` (`table/__init__.py`), the delete 
predicate
   is built using only the **current** partition spec:
   ```python
   delete_filter = self._build_partition_predicate(
       partition_records=partitions_to_overwrite,
       spec=self.table_metadata.spec(),       # always current spec
       schema=self.table_metadata.schema()
   )
   ```
   
   A snapshot with mixed `partition_spec_id`s (spec-0 and spec-1 manifests) 
passes
   this single predicate to `_DeleteFiles`. The manifest evaluator in 
`_build_partition_projection`
   uses `inclusive_projection(schema, spec)` — when projecting a spec-1 
predicate
   (e.g. `category=A AND region=us`) through spec-0 (which only has 
`category`), the
   `region` reference has no corresponding partition field, causing the 
evaluator to
   incorrectly skip spec-0 manifests entirely.
   
   ## Reproduction
   ```python
   import tempfile, pyarrow as pa
   from pyiceberg.catalog import load_catalog
   from pyiceberg.schema import Schema
   from pyiceberg.types import NestedField, StringType, LongType
   from pyiceberg.partitioning import PartitionSpec, PartitionField
   from pyiceberg.transforms import IdentityTransform
   
   schema = Schema(
       NestedField(1, "category", StringType(), required=False),
       NestedField(2, "region",   StringType(), required=False),
       NestedField(3, "value",    LongType(),   required=False),
   )
   spec_v0 = PartitionSpec(
       PartitionField(source_id=1, field_id=1000, 
transform=IdentityTransform(), name="category")
   )
   
   with tempfile.TemporaryDirectory() as warehouse:
       catalog = load_catalog("test", **{"type": "sql", "uri": 
f"sqlite:///{warehouse}/catalog.db", "warehouse": f"file://{warehouse}"})
       catalog.create_namespace("default")
       table = catalog.create_table("default.test", schema=schema, 
partition_spec=spec_v0)
   
       # Write under spec 0
       table.append(pa.table({"category": ["A","A","B"], "region": 
[None,None,None], "value": [1,2,10]}))
   
       # Evolve spec
       with table.update_spec() as u:
           u.add_field("region", IdentityTransform(), "region")
       table = catalog.load_table("default.test")
   
       # Write under spec 1
       table.append(pa.table({"category": ["A","B"], "region": ["us","us"], 
"value": [100,200]}))
   
       # Overwrite category=A — should delete ALL A rows (both specs)
       table.dynamic_partition_overwrite(
           pa.table({"category": ["A"], "region": ["us"], "value": [999]})
       )
   
       result = table.scan().to_arrow().to_pydict()
       a_values = [v for c,v in zip(result["category"], result["value"]) if c 
== "A"]
       print(a_values)  # BUG: prints [1, 2, 100, 999] — stale rows from spec-0 
not deleted
                        # EXPECTED: [999]
   ```
   
   ## Fix
   
   Build the delete predicate **per historical spec** present in the snapshot, 
projecting
   the new data files' partition values into each spec's coordinate space 
before evaluating.
   PR with fix and regression tests to follow.
   
   ## Related
   - #3011 (introduced the manifest pruning optimization)
   - #1108 (prior related fix by @Fokko for spec evolution in manifest 
rewriting)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to