shangxinli commented on code in PR #592:
URL: https://github.com/apache/iceberg-cpp/pull/592#discussion_r3035738291
##########
src/iceberg/update/expire_snapshots.h:
##########
@@ -159,6 +173,47 @@ class ICEBERG_EXPORT ExpireSnapshots : public
PendingUpdate {
Result<std::unordered_set<int64_t>> UnreferencedSnapshotIdsToRetain(
const SnapshotToRef& refs) const;
+ /// \brief Clean up files no longer referenced after snapshot expiration.
+ ///
+ /// Implements the "reachable file cleanup" strategy from Java's
ReachableFileCleanup:
+ /// 1. Collect manifests from expired and retained snapshots
+ /// 2. Prune manifests still referenced by retained snapshots
+ /// 3. Find data files only in manifests being deleted (if kAll)
+ /// 4. Remove data files still reachable from retained manifests
+ /// 5. Delete orphaned manifests, manifest lists, and statistics files
+ ///
+ /// All deletions are best-effort: failures are suppressed to avoid blocking
+ /// metadata updates (matching Java's suppressFailureWhenFinished behavior).
+ ///
+ /// Branch/tag awareness: retained_snapshot_ids includes all snapshots
referenced
+ /// by any branch or tag, as computed by Apply(). This prevents deleting
files
+ /// that are still reachable from any ref.
+ ///
+ /// TODO(shangxinli): Add multi-threaded file deletion support.
+ /// TODO(shangxinli): Add IncrementalFileCleanup strategy for linear
ancestry.
+ Status CleanExpiredFiles(const std::vector<int64_t>& expired_snapshot_ids);
+
+ /// \brief Read manifest paths from a single snapshot.
+ /// Best-effort: returns OK even if the snapshot or its manifests can't be
read.
+ Status ReadManifestsForSnapshot(int64_t snapshot_id,
Review Comment:
Done — introduced FileCleanupStrategy abstract base class and
ReachableFileCleanup concrete subclass in the anonymous namespace of
expire_snapshots.cc, mirroring Java's class hierarchy. They live in the .cc for
now since they're unlikely to be used elsewhere; we can promote to a header
when IncrementalFileCleanup is added.
##########
src/iceberg/update/expire_snapshots.h:
##########
@@ -159,6 +173,47 @@ class ICEBERG_EXPORT ExpireSnapshots : public
PendingUpdate {
Result<std::unordered_set<int64_t>> UnreferencedSnapshotIdsToRetain(
const SnapshotToRef& refs) const;
+ /// \brief Clean up files no longer referenced after snapshot expiration.
+ ///
+ /// Implements the "reachable file cleanup" strategy from Java's
ReachableFileCleanup:
+ /// 1. Collect manifests from expired and retained snapshots
+ /// 2. Prune manifests still referenced by retained snapshots
+ /// 3. Find data files only in manifests being deleted (if kAll)
+ /// 4. Remove data files still reachable from retained manifests
+ /// 5. Delete orphaned manifests, manifest lists, and statistics files
+ ///
+ /// All deletions are best-effort: failures are suppressed to avoid blocking
+ /// metadata updates (matching Java's suppressFailureWhenFinished behavior).
+ ///
+ /// Branch/tag awareness: retained_snapshot_ids includes all snapshots
referenced
+ /// by any branch or tag, as computed by Apply(). This prevents deleting
files
+ /// that are still reachable from any ref.
+ ///
+ /// TODO(shangxinli): Add multi-threaded file deletion support.
+ /// TODO(shangxinli): Add IncrementalFileCleanup strategy for linear
ancestry.
+ Status CleanExpiredFiles(const std::vector<int64_t>& expired_snapshot_ids);
+
+ /// \brief Read manifest paths from a single snapshot.
+ /// Best-effort: returns OK even if the snapshot or its manifests can't be
read.
+ Status ReadManifestsForSnapshot(int64_t snapshot_id,
+ std::unordered_set<std::string>&
manifest_paths);
+
+ /// \brief Find data files to delete by reading live entries from manifests
being
+ /// deleted, then subtracting files still reachable from retained manifests.
+ /// If a retained manifest cannot be read, returns an empty set to prevent
+ /// accidental data loss.
+ Result<std::unordered_set<std::string>> FindDataFilesToDelete(
Review Comment:
Done — ReadManifestsForSnapshot is now a private method of
ReachableFileCleanup only.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]