drbothen opened a new issue, #2148:
URL: https://github.com/apache/iceberg-rust/issues/2148
## Description
`FastAppendAction::existing_manifest()` in
`crates/iceberg/src/transaction/append.rs` filters manifest list entries with:
```rust
.filter(|entry| entry.has_added_files() || entry.has_existing_files())
```
This drops manifests that contain **only** Deleted entries
(`has_deleted_files()` but neither `has_added_files()` nor
`has_existing_files()`).
## Impact
After a `rewrite_files` operation (or any operation that creates a
delete-only manifest to mark old files as removed), a subsequent `fast_append`
drops the delete manifest from the new snapshot's manifest list. The old
manifests still carry Added entries for the removed files, but there is no
longer a Delete manifest to exclude them. The deleted files reappear as alive.
This causes **compounding data duplication** — each subsequent append or
rewrite cycle adds another copy of the "ghost" files, producing exponential row
growth:
```
Cycle 1: 72 rows
Cycle 2: 145 rows
Cycle 3: 297 rows
...
Cycle 12: 235,026 rows
```
## Root Cause
The filter in `existing_manifest()` was intended to skip empty manifests,
but it inadvertently skips delete-only manifests. A delete-only manifest is not
empty — it records which file paths were removed and must be preserved until
`expire_snapshots` cleans it up.
## Fix
Add `|| entry.has_deleted_files()` to the filter:
```rust
.filter(|entry| {
entry.has_added_files()
|| entry.has_existing_files()
|| entry.has_deleted_files()
})
```
## Reproduction
1. Create a table and append data files
2. Perform a `rewrite_files` operation (replaces old files with a compacted
file)
3. Perform a `fast_append` with new data files
4. Scan the table — deleted files from step 2 reappear as live data
5. Repeat steps 2-4 — duplication compounds exponentially
## Notes
- Currently, `rewrite_files` is not yet on `main`, so this bug is latent. It
becomes immediately triggerable once any operation that produces delete-only
manifests lands.
- The Iceberg spec requires delete manifests to persist across snapshots
until they are cleaned up by `expire_snapshots`. Dropping them prematurely
violates snapshot isolation guarantees.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]