shangxinli opened a new pull request, #592:
URL: https://github.com/apache/iceberg-cpp/pull/592
## Summary
- Implement the file cleanup logic missing from expire snapshots (#490 noted
"TODO: File recycling will be added in a followup PR")
- Port the "reachable file cleanup" strategy from Java's
`ReachableFileCleanup`
- Single-threaded implementation; multi-threaded and incremental cleanup as
TODOs
## Changes
- Add `Finalize()` override called after successful commit to clean up
expired files
- Add `CleanExpiredFiles()` implementing the reachable file cleanup strategy:
1. Collect manifest paths from expired and retained snapshots
2. Prune manifests still referenced by retained snapshots
3. Find data files only in manifests being deleted, subtract files still
reachable from retained manifests
4. Delete orphaned manifests, manifest lists, and statistics files
- Best-effort deletion: suppress errors on individual file deletions to
avoid blocking metadata updates (matching Java's `suppressFailureWhenFinished`)
- Branch/tag awareness: retained snapshot set includes all snapshots
reachable from any ref
- Respect `CleanupLevel`: `kNone` skips all, `kMetadataOnly` skips data
files, `kAll` cleans everything
- Uses `FileIO::DeleteFile` for filesystem compatibility (S3, HDFS, local)
- 5 new tests for file cleanup behavior
## Test plan
- [x] All 303 existing tests pass
- [x] 9 expire snapshots tests pass (4 existing + 5 new)
- [x] `CleanupLevelNoneSkipsFileDeletion` — verifies kNone skips all deletion
- [x] `FinalizeSkippedOnCommitError` — verifies no cleanup on commit failure
- [x] `FinalizeSkippedWhenNoSnapshotsExpired` — verifies no cleanup when
nothing expired
- [x] `DeleteWithCustomFunction` — verifies custom delete function is invoked
- [x] `CommitWithCleanupLevelNone` — end-to-end commit with metadata update
Closes #364
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]