Re: [PR] feat(blob): Supplementing RFC-100 with blob cleaner design [hudi]

via GitHub Thu, 02 Apr 2026 08:20:31 -0700


voonhous commented on code in PR #18359:
URL: https://github.com/apache/hudi/pull/18359#discussion_r3028728932



##########
rfc/rfc-100/rfc-100-blob-cleaner-problem.md:
##########
@@ -0,0 +1,745 @@
+# Blob Cleaner: Problem Statement
+
+## 1. Goal
+
+When old file slices are cleaned, out-of-line blob files they reference may 
become orphaned -- still
+consuming storage but unreachable by any query. The blob cleaner must identify 
and delete these
+unreferenced blob files without premature deletion (deleting a blob that is 
still referenced by a live
+record). This document defines the problem scope, design constraints, 
requirements, and illustrative
+failure modes. It contains no solution content.
+
+---
+
+## 2. Scope
+
+### In scope
+
+- Cleanup of **out-of-line blob files** when references to them exist only in 
expired (cleaned) file
+  slices.
+- All table types: **COW** and **MOR**.
+- All cleaning policies: `KEEP_LATEST_COMMITS`, `KEEP_LATEST_FILE_VERSIONS`,
+  `KEEP_LATEST_BY_HOURS`.
+- Interaction with table services: **compaction**, **clustering**, **blob 
compaction**.
+- Interaction with timeline operations: **savepoints**, **rollback**, 
**archival**.
+- Single-writer and multi-writer (OCC) concurrency modes.
+- Both **Hudi-created blobs** (stored under `{table}/.hoodie/blobs/...`) and 
**user-provided
+  external blobs** (arbitrary paths).
+
+### Two entry flows
+
+Blob cleanup must support two distinct entry flows. These are not edge cases 
of each other --
+they are co-equal paths with different properties, different volumes, and 
different cleanup costs.
+
+**Flow 1: Path-dispatched (Hudi-created blobs).** Blobs created by Hudi's 
write path and stored
+under `{table}/.hoodie/blobs/{partition}/{col}/{instant}/{blob_id}`. The path 
structure guarantees
+uniqueness (C11), file-group scoping, and eliminates cross-FG sharing for 
normal writes. This is the
+expected majority flow for Phase 3 workloads.
+
+**Flow 2: Non-path-dispatched (user-provided external blobs).** Users have 
existing blob files in
+external storage (e.g., `s3://media-bucket/videos/`, a shared NFS mount, or 
any user-controlled
+path). Records reference these blobs directly by path. The user does **not** 
want to bootstrap --
+they do not want Hudi to copy, move, or reorganize the blob files into 
`.hoodie/blobs/`. Hudi
+manages the *references*, not the *storage layout*. This is the expected 
primary flow for Phase 1
+workloads and remains a supported flow in Phase 3.
+
+The non-path-dispatched flow has fundamentally different properties:
+
+| Property                  | Path-dispatched (Hudi-created)    | 
Non-path-dispatched (external)       |
+|---------------------------|-----------------------------------|--------------------------------------|
+| Path uniqueness           | Guaranteed (instant in path, C11) | Not 
guaranteed (user controls)       |
+| Cross-FG sharing          | Does not occur (FG-scoped)        | Common 
(multiple records, same blob) |
+| Writer/cleaner race       | Cannot occur (D2)                 | Can occur 
(D3)                       |
+| Delete-and-re-add (C2)    | Eliminated                        | Real concern 
                        |
+| Volume                    | Scales with writes                | Can be large 
from day one            |
+| Per-FG cleanup sufficient | Yes                               | No -- 
cross-FG verification needed   |
+
+Any solution that treats the non-path-dispatched flow as a rare edge case will 
fail at scale for
+Phase 1 workloads. The cleanup algorithm must be efficient for **both** flows 
independently, and
+must not impose the cost structure of one flow on the other.
+
+### Out of scope
+
+- **Inline blobs.** Inline blob data lives inside the base/log file and is 
deleted when the file
+  slice is cleaned. No additional cleanup needed.
+- **Blob compaction internals.** Blob compaction (repacking partially-live 
container files) is a
+  separate service. This document defines the interface point (when to hand 
off to blob compaction)
+  but not its internal design.
+- **Schema evolution.** Adding or removing blob columns does not change the 
cleanup problem.
+
+### Stance on the `managed` flag
+
+The BlobReference schema includes a `managed` boolean field
+(`HoodieSchema.Blob.EXTERNAL_REFERENCE_IS_MANAGED`). The RFC states that only 
managed blobs are
+cleaned. This document acknowledges the flag and treats it as a **filter** -- 
unmanaged blobs are
+excluded from cleanup consideration. However, the cleanup design must be 
**correct regardless of the
+flag's value**. The flag selects *which* blobs enter the cleanup pipeline; it 
must not be used as a
+correctness lever within the pipeline itself. The flag may later serve as an 
optimization (skip
+cleanup work for unmanaged blobs), but the problem statement and any solution 
must not depend on it
+for safety.
+
+---
+
+## 3. Background: Existing Cleaner
+
+The existing Hudi cleaner provides the execution framework that blob cleanup 
must integrate with.
+
+### Plan-execute model
+
+Cleaning is a two-phase operation:
+
+1. **Plan** (`CleanPlanner`): For each partition and file group, determine 
which file slices are
+   expired based on the cleaning policy. Produce a `HoodieCleanerPlan` listing 
file paths to delete.
+2. **Execute** (`CleanActionExecutor`): Delete the files listed in the plan. 
Record results in
+   `HoodieCleanMetadata` on the timeline.
+
+### Per-partition, per-file-group iteration
+
+`CleanPlanner.getDeletePaths(partitionPath, earliestCommitToRetain)` iterates 
file groups within a
+partition. For each file group, it compares file slices against the retention 
policy and produces a
+list of `CleanFileInfo` objects (file paths to delete). The cleaner has no 
concept of cross-file-group
+dependencies.
+
+### Savepoint awareness
+
+The cleaner collects all savepointed timestamps and their associated data 
files. File slices that
+overlap with savepointed files are excluded from cleaning
+(`isFileSliceExistInSavepointedFiles`). This preserves the savepoint 
invariant: a savepoint freezes a
+consistent snapshot including all data files it references.
+
+### OCC conflict resolution
+
+`SimpleConcurrentFileWritesConflictResolutionStrategy` resolves write-write 
conflicts at the
+`(partition, fileId)` granularity. There is no global serialization point. 
Concurrent writers to
+different file groups proceed without contention.
+
+### Critical gap
+
+The existing cleaner operates on file paths (base files + log files) within a 
single file group. It
+has **no concept of transitive references** -- it does not know that a file 
slice contains pointers
+to external blob files that may need separate cleanup. Blob cleanup requires 
extending the cleaner
+to follow these references and determine blob-level liveness.
+
+---
+
+## 4. Design Constraints
+
+Each constraint is a fact about the Hudi system that any blob cleanup solution 
must respect. Violating
+any constraint leads to data corruption, premature deletion, or permanent 
orphans.
+
+### C1: Blob immutability
+
+Once a blob file is written, its content never changes. Blob files are 
append-once, read-many. This
+means a blob file's identity is stable for its entire lifetime.
+
+*Source: RFC-100 blob cleaner design, general storage semantics.*
+
+### C2: Delete-and-re-add same path
+
+A blob file can be deleted from storage and a new file created at the same 
path with different
+content. This is a real concern for user-provided external blobs (the user 
controls the path). For
+Hudi-created blobs, it is structurally eliminated by C11 (instant in path 
guarantees uniqueness).
+
+*Source: RFC-100 blob cleaner design; alternatives analysis constraint C2.*
+
+### C3: Cross-file-group blob sharing
+
+An out-of-line blob can be referenced by records in multiple file groups and 
multiple partitions. This
+is explicitly supported for user-provided external blobs: two records in 
different file groups can
+point to the same external file. For Hudi-created blobs, cross-FG sharing does 
not occur because the
+blob is created within a specific file group's storage scope (see C11). 
However, after clustering
+(C8), references to the same Hudi-created blob could temporarily exist in both 
the source and target
+file groups until the source is cleaned.
+
+*Source: RFC-100 lines 196-198 (Option 1 scans all active file slices); 
alternatives analysis F6.*
+
+### C4: Container files
+
+Multiple blobs can be packed into a single container file, distinguished by 
`(offset, length)` within
+the BlobReference. A container file can only be deleted when **all** byte 
ranges within it are
+unreferenced. If some ranges are orphaned but others are still live, the 
container cannot be deleted --
+it must be handed off to blob compaction for repacking.
+
+*Source: BlobReference schema fields `offset` and `length`; RFC-100 lines 
164-165 (container config);
+alternatives analysis F1.*
+
+### C5: MOR log updates shadow base file blob refs
+
+In MOR tables, a log file update to a record's blob reference supersedes the 
base file's blob
+reference for that record. The base file's blob ref appears live (it exists in 
an active file slice)
+but is actually dead (the log update replaced it). Reading only the base file 
produces a **superset**
+of live references. Over-retention (keeping the shadowed blob longer) is safe. 
Under-retention
+(treating the log-shadowed base ref as already cleaned) would cause premature 
deletion if the log
+update is later rolled back.
+
+*Source: RFC-100 line 122 (merge mode determines which blob reference is 
returned); MOR semantics.*
+
+### C6: Existing cleaner is per-file-group scoped
+
+`CleanPlanner` iterates per `HoodieFileGroup` within each partition. It 
determines expired file slices
+within a single file group. There is no existing mechanism to evaluate 
cross-file-group dependencies
+during cleaning.
+
+*Source: `CleanPlanner.getDeletePaths()`, 
`CleanPlanner.getFilesToCleanKeepingLatestCommits()`;
+alternatives analysis F11.*
+
+### C7: OCC is per-file-group (no global contention allowed)
+
+Concurrent writer conflict resolution operates at `(partition, fileId)` 
granularity. Any solution that
+introduces a global contention point (global counter, global lock, global 
bitmap) violates this
+constraint and degrades write throughput under concurrency.
+
+*Source: `SimpleConcurrentFileWritesConflictResolutionStrategy`; alternatives 
analysis F12.*
+
+### C8: Clustering moves blob refs between file groups
+
+Clustering reads records from source file groups and rewrites them to target 
file groups. For
+Hudi-managed blobs, clustering creates **new** blob files in the target file 
group. For external
+blobs, clustering copies the pointer (same path, same offset/length) to the 
target file group. After
+clustering, the source file group's slices still reference the original blobs 
until those slices are
+cleaned. The target file group's slices reference either new blobs 
(Hudi-managed) or the same
+external blobs.
+
+*Source: RFC-100 lines 212-214.*
+
+### C9: Savepoints freeze file slices and their blob refs
+
+A savepoint preserves a consistent snapshot. File slices covered by a 
savepoint are excluded from
+cleaning. This means any blob referenced by a savepointed file slice must also 
be preserved, even if
+the blob would otherwise be considered orphaned. The cleaner already handles 
savepoint exclusion for
+file slices; blob cleanup must extend this guarantee to the blobs they 
reference.
+
+*Source: `CleanPlanner.savepointedTimestamps`, 
`isFileSliceExistInSavepointedFiles()`.*
+
+### C10: Rollback can invalidate or resurrect references
+
+Rolling back a commit can remove file slices that were the sole reference to a 
blob (the blob becomes
+orphaned). Conversely, rolling back a commit that updated a record's blob 
reference can resurrect the
+previous reference (an older blob that appeared orphaned is now live again). 
Any blob cleanup solution
+must account for both directions.
+
+*Source: Hudi rollback semantics; timeline management.*
+
+### C11: Hudi-created blob paths include instant (structurally unique)
+
+Hudi-created blob files are stored at
+`{table_path}/.hoodie/blobs/{partition}/{column_name}/{instant}/{blob_id}`. 
Because the commit
+instant is embedded in the path, two different writes always produce different 
blob paths. This
+eliminates the delete-and-re-add problem (C2) for Hudi-created blobs and means 
they are inherently
+scoped to a single file group's write context.
+
+*Source: RFC-100 line 170; alternatives analysis F3.*
+
+### C12: Archival removes commit metadata from active timeline
+
+Hudi's archival process moves completed commits from the active timeline to 
the archived timeline.

Review Comment:
   I am thinking of storing blob paths to delete in a metadata partition 
instead where it gets occassionally cleaned. It's a more optimized format iiuc.



##########
rfc/rfc-100/rfc-100-blob-cleaner-problem.md:
##########
@@ -0,0 +1,745 @@
+# Blob Cleaner: Problem Statement
+
+## 1. Goal
+
+When old file slices are cleaned, out-of-line blob files they reference may 
become orphaned -- still
+consuming storage but unreachable by any query. The blob cleaner must identify 
and delete these
+unreferenced blob files without premature deletion (deleting a blob that is 
still referenced by a live
+record). This document defines the problem scope, design constraints, 
requirements, and illustrative
+failure modes. It contains no solution content.
+
+---
+
+## 2. Scope
+
+### In scope
+
+- Cleanup of **out-of-line blob files** when references to them exist only in 
expired (cleaned) file
+  slices.
+- All table types: **COW** and **MOR**.
+- All cleaning policies: `KEEP_LATEST_COMMITS`, `KEEP_LATEST_FILE_VERSIONS`,
+  `KEEP_LATEST_BY_HOURS`.
+- Interaction with table services: **compaction**, **clustering**, **blob 
compaction**.
+- Interaction with timeline operations: **savepoints**, **rollback**, 
**archival**.
+- Single-writer and multi-writer (OCC) concurrency modes.
+- Both **Hudi-created blobs** (stored under `{table}/.hoodie/blobs/...`) and 
**user-provided
+  external blobs** (arbitrary paths).
+
+### Two entry flows
+
+Blob cleanup must support two distinct entry flows. These are not edge cases 
of each other --
+they are co-equal paths with different properties, different volumes, and 
different cleanup costs.
+
+**Flow 1: Path-dispatched (Hudi-created blobs).** Blobs created by Hudi's 
write path and stored
+under `{table}/.hoodie/blobs/{partition}/{col}/{instant}/{blob_id}`. The path 
structure guarantees
+uniqueness (C11), file-group scoping, and eliminates cross-FG sharing for 
normal writes. This is the
+expected majority flow for Phase 3 workloads.
+
+**Flow 2: Non-path-dispatched (user-provided external blobs).** Users have 
existing blob files in
+external storage (e.g., `s3://media-bucket/videos/`, a shared NFS mount, or 
any user-controlled
+path). Records reference these blobs directly by path. The user does **not** 
want to bootstrap --
+they do not want Hudi to copy, move, or reorganize the blob files into 
`.hoodie/blobs/`. Hudi
+manages the *references*, not the *storage layout*. This is the expected 
primary flow for Phase 1
+workloads and remains a supported flow in Phase 3.
+
+The non-path-dispatched flow has fundamentally different properties:
+
+| Property                  | Path-dispatched (Hudi-created)    | 
Non-path-dispatched (external)       |
+|---------------------------|-----------------------------------|--------------------------------------|
+| Path uniqueness           | Guaranteed (instant in path, C11) | Not 
guaranteed (user controls)       |
+| Cross-FG sharing          | Does not occur (FG-scoped)        | Common 
(multiple records, same blob) |
+| Writer/cleaner race       | Cannot occur (D2)                 | Can occur 
(D3)                       |
+| Delete-and-re-add (C2)    | Eliminated                        | Real concern 
                        |
+| Volume                    | Scales with writes                | Can be large 
from day one            |
+| Per-FG cleanup sufficient | Yes                               | No -- 
cross-FG verification needed   |
+
+Any solution that treats the non-path-dispatched flow as a rare edge case will 
fail at scale for
+Phase 1 workloads. The cleanup algorithm must be efficient for **both** flows 
independently, and
+must not impose the cost structure of one flow on the other.
+
+### Out of scope
+
+- **Inline blobs.** Inline blob data lives inside the base/log file and is 
deleted when the file
+  slice is cleaned. No additional cleanup needed.
+- **Blob compaction internals.** Blob compaction (repacking partially-live 
container files) is a
+  separate service. This document defines the interface point (when to hand 
off to blob compaction)
+  but not its internal design.
+- **Schema evolution.** Adding or removing blob columns does not change the 
cleanup problem.
+
+### Stance on the `managed` flag
+
+The BlobReference schema includes a `managed` boolean field
+(`HoodieSchema.Blob.EXTERNAL_REFERENCE_IS_MANAGED`). The RFC states that only 
managed blobs are
+cleaned. This document acknowledges the flag and treats it as a **filter** -- 
unmanaged blobs are
+excluded from cleanup consideration. However, the cleanup design must be 
**correct regardless of the
+flag's value**. The flag selects *which* blobs enter the cleanup pipeline; it 
must not be used as a
+correctness lever within the pipeline itself. The flag may later serve as an 
optimization (skip
+cleanup work for unmanaged blobs), but the problem statement and any solution 
must not depend on it
+for safety.
+
+---
+
+## 3. Background: Existing Cleaner
+
+The existing Hudi cleaner provides the execution framework that blob cleanup 
must integrate with.
+
+### Plan-execute model
+
+Cleaning is a two-phase operation:
+
+1. **Plan** (`CleanPlanner`): For each partition and file group, determine 
which file slices are
+   expired based on the cleaning policy. Produce a `HoodieCleanerPlan` listing 
file paths to delete.
+2. **Execute** (`CleanActionExecutor`): Delete the files listed in the plan. 
Record results in
+   `HoodieCleanMetadata` on the timeline.
+
+### Per-partition, per-file-group iteration
+
+`CleanPlanner.getDeletePaths(partitionPath, earliestCommitToRetain)` iterates 
file groups within a
+partition. For each file group, it compares file slices against the retention 
policy and produces a
+list of `CleanFileInfo` objects (file paths to delete). The cleaner has no 
concept of cross-file-group
+dependencies.
+
+### Savepoint awareness
+
+The cleaner collects all savepointed timestamps and their associated data 
files. File slices that
+overlap with savepointed files are excluded from cleaning
+(`isFileSliceExistInSavepointedFiles`). This preserves the savepoint 
invariant: a savepoint freezes a
+consistent snapshot including all data files it references.
+
+### OCC conflict resolution
+
+`SimpleConcurrentFileWritesConflictResolutionStrategy` resolves write-write 
conflicts at the
+`(partition, fileId)` granularity. There is no global serialization point. 
Concurrent writers to
+different file groups proceed without contention.
+
+### Critical gap
+
+The existing cleaner operates on file paths (base files + log files) within a 
single file group. It
+has **no concept of transitive references** -- it does not know that a file 
slice contains pointers
+to external blob files that may need separate cleanup. Blob cleanup requires 
extending the cleaner
+to follow these references and determine blob-level liveness.
+
+---
+
+## 4. Design Constraints
+
+Each constraint is a fact about the Hudi system that any blob cleanup solution 
must respect. Violating
+any constraint leads to data corruption, premature deletion, or permanent 
orphans.
+
+### C1: Blob immutability
+
+Once a blob file is written, its content never changes. Blob files are 
append-once, read-many. This
+means a blob file's identity is stable for its entire lifetime.
+
+*Source: RFC-100 blob cleaner design, general storage semantics.*
+
+### C2: Delete-and-re-add same path
+
+A blob file can be deleted from storage and a new file created at the same 
path with different
+content. This is a real concern for user-provided external blobs (the user 
controls the path). For
+Hudi-created blobs, it is structurally eliminated by C11 (instant in path 
guarantees uniqueness).
+
+*Source: RFC-100 blob cleaner design; alternatives analysis constraint C2.*
+
+### C3: Cross-file-group blob sharing
+
+An out-of-line blob can be referenced by records in multiple file groups and 
multiple partitions. This
+is explicitly supported for user-provided external blobs: two records in 
different file groups can
+point to the same external file. For Hudi-created blobs, cross-FG sharing does 
not occur because the
+blob is created within a specific file group's storage scope (see C11). 
However, after clustering
+(C8), references to the same Hudi-created blob could temporarily exist in both 
the source and target
+file groups until the source is cleaned.
+
+*Source: RFC-100 lines 196-198 (Option 1 scans all active file slices); 
alternatives analysis F6.*
+
+### C4: Container files
+
+Multiple blobs can be packed into a single container file, distinguished by 
`(offset, length)` within
+the BlobReference. A container file can only be deleted when **all** byte 
ranges within it are
+unreferenced. If some ranges are orphaned but others are still live, the 
container cannot be deleted --
+it must be handed off to blob compaction for repacking.
+
+*Source: BlobReference schema fields `offset` and `length`; RFC-100 lines 
164-165 (container config);
+alternatives analysis F1.*
+
+### C5: MOR log updates shadow base file blob refs
+
+In MOR tables, a log file update to a record's blob reference supersedes the 
base file's blob
+reference for that record. The base file's blob ref appears live (it exists in 
an active file slice)
+but is actually dead (the log update replaced it). Reading only the base file 
produces a **superset**
+of live references. Over-retention (keeping the shadowed blob longer) is safe. 
Under-retention
+(treating the log-shadowed base ref as already cleaned) would cause premature 
deletion if the log
+update is later rolled back.
+
+*Source: RFC-100 line 122 (merge mode determines which blob reference is 
returned); MOR semantics.*
+
+### C6: Existing cleaner is per-file-group scoped
+
+`CleanPlanner` iterates per `HoodieFileGroup` within each partition. It 
determines expired file slices
+within a single file group. There is no existing mechanism to evaluate 
cross-file-group dependencies
+during cleaning.
+
+*Source: `CleanPlanner.getDeletePaths()`, 
`CleanPlanner.getFilesToCleanKeepingLatestCommits()`;
+alternatives analysis F11.*
+
+### C7: OCC is per-file-group (no global contention allowed)
+
+Concurrent writer conflict resolution operates at `(partition, fileId)` 
granularity. Any solution that
+introduces a global contention point (global counter, global lock, global 
bitmap) violates this
+constraint and degrades write throughput under concurrency.
+
+*Source: `SimpleConcurrentFileWritesConflictResolutionStrategy`; alternatives 
analysis F12.*
+
+### C8: Clustering moves blob refs between file groups
+
+Clustering reads records from source file groups and rewrites them to target 
file groups. For
+Hudi-managed blobs, clustering creates **new** blob files in the target file 
group. For external
+blobs, clustering copies the pointer (same path, same offset/length) to the 
target file group. After
+clustering, the source file group's slices still reference the original blobs 
until those slices are
+cleaned. The target file group's slices reference either new blobs 
(Hudi-managed) or the same
+external blobs.
+
+*Source: RFC-100 lines 212-214.*
+
+### C9: Savepoints freeze file slices and their blob refs
+
+A savepoint preserves a consistent snapshot. File slices covered by a 
savepoint are excluded from
+cleaning. This means any blob referenced by a savepointed file slice must also 
be preserved, even if
+the blob would otherwise be considered orphaned. The cleaner already handles 
savepoint exclusion for
+file slices; blob cleanup must extend this guarantee to the blobs they 
reference.
+
+*Source: `CleanPlanner.savepointedTimestamps`, 
`isFileSliceExistInSavepointedFiles()`.*
+
+### C10: Rollback can invalidate or resurrect references
+
+Rolling back a commit can remove file slices that were the sole reference to a 
blob (the blob becomes
+orphaned). Conversely, rolling back a commit that updated a record's blob 
reference can resurrect the
+previous reference (an older blob that appeared orphaned is now live again). 
Any blob cleanup solution
+must account for both directions.
+
+*Source: Hudi rollback semantics; timeline management.*
+
+### C11: Hudi-created blob paths include instant (structurally unique)
+
+Hudi-created blob files are stored at
+`{table_path}/.hoodie/blobs/{partition}/{column_name}/{instant}/{blob_id}`. 
Because the commit
+instant is embedded in the path, two different writes always produce different 
blob paths. This
+eliminates the delete-and-re-add problem (C2) for Hudi-created blobs and means 
they are inherently
+scoped to a single file group's write context.
+
+*Source: RFC-100 line 170; alternatives analysis F3.*
+
+### C12: Archival removes commit metadata from active timeline
+
+Hudi's archival process moves completed commits from the active timeline to 
the archived timeline.

Review Comment:
   I am thinking of storing blob paths to delete in a metadata partition 
instead where it gets occassionally cleaned. It's a more optimized format iiuc.
   
   Will need to investigate this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(blob): Supplementing RFC-100 with blob cleaner design [hudi]

Reply via email to