drbothen opened a new issue, #2148:
URL: https://github.com/apache/iceberg-rust/issues/2148

   ## Description
   
   `FastAppendAction::existing_manifest()` in 
`crates/iceberg/src/transaction/append.rs` filters manifest list entries with:
   
   ```rust
   .filter(|entry| entry.has_added_files() || entry.has_existing_files())
   ```
   
   This drops manifests that contain **only** Deleted entries 
(`has_deleted_files()` but neither `has_added_files()` nor 
`has_existing_files()`).
   
   ## Impact
   
   After a `rewrite_files` operation (or any operation that creates a 
delete-only manifest to mark old files as removed), a subsequent `fast_append` 
drops the delete manifest from the new snapshot's manifest list. The old 
manifests still carry Added entries for the removed files, but there is no 
longer a Delete manifest to exclude them. The deleted files reappear as alive.
   
   This causes **compounding data duplication** — each subsequent append or 
rewrite cycle adds another copy of the "ghost" files, producing exponential row 
growth:
   
   ```
   Cycle 1: 72 rows
   Cycle 2: 145 rows
   Cycle 3: 297 rows
   ...
   Cycle 12: 235,026 rows
   ```
   
   ## Root Cause
   
   The filter in `existing_manifest()` was intended to skip empty manifests, 
but it inadvertently skips delete-only manifests. A delete-only manifest is not 
empty — it records which file paths were removed and must be preserved until 
`expire_snapshots` cleans it up.
   
   ## Fix
   
   Add `|| entry.has_deleted_files()` to the filter:
   
   ```rust
   .filter(|entry| {
       entry.has_added_files()
           || entry.has_existing_files()
           || entry.has_deleted_files()
   })
   ```
   
   ## Reproduction
   
   1. Create a table and append data files
   2. Perform a `rewrite_files` operation (replaces old files with a compacted 
file)
   3. Perform a `fast_append` with new data files
   4. Scan the table — deleted files from step 2 reappear as live data
   5. Repeat steps 2-4 — duplication compounds exponentially
   
   ## Notes
   
   - Currently, `rewrite_files` is not yet on `main`, so this bug is latent. It 
becomes immediately triggerable once any operation that produces delete-only 
manifests lands.
   - The Iceberg spec requires delete manifests to persist across snapshots 
until they are cleaned up by `expire_snapshots`. Dropping them prematurely 
violates snapshot isolation guarantees.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to