Re: [PR] feat(blob): Supplementing RFC-100 with blob cleaner design [hudi]

via GitHub Thu, 02 Apr 2026 01:04:21 -0700


voonhous commented on code in PR #18359:
URL: https://github.com/apache/hudi/pull/18359#discussion_r3026527966



##########
rfc/rfc-100/rfc-100-blob-cleaner-problem.md:
##########
@@ -0,0 +1,745 @@
+# Blob Cleaner: Problem Statement
+
+## 1. Goal
+
+When old file slices are cleaned, out-of-line blob files they reference may 
become orphaned -- still
+consuming storage but unreachable by any query. The blob cleaner must identify 
and delete these
+unreferenced blob files without premature deletion (deleting a blob that is 
still referenced by a live
+record). This document defines the problem scope, design constraints, 
requirements, and illustrative
+failure modes. It contains no solution content.
+
+---
+
+## 2. Scope
+
+### In scope
+
+- Cleanup of **out-of-line blob files** when references to them exist only in 
expired (cleaned) file
+  slices.
+- All table types: **COW** and **MOR**.
+- All cleaning policies: `KEEP_LATEST_COMMITS`, `KEEP_LATEST_FILE_VERSIONS`,
+  `KEEP_LATEST_BY_HOURS`.
+- Interaction with table services: **compaction**, **clustering**, **blob 
compaction**.
+- Interaction with timeline operations: **savepoints**, **rollback**, 
**archival**.
+- Single-writer and multi-writer (OCC) concurrency modes.
+- Both **Hudi-created blobs** (stored under `{table}/.hoodie/blobs/...`) and 
**user-provided
+  external blobs** (arbitrary paths).
+
+### Two entry flows
+
+Blob cleanup must support two distinct entry flows. These are not edge cases 
of each other --
+they are co-equal paths with different properties, different volumes, and 
different cleanup costs.
+
+**Flow 1: Path-dispatched (Hudi-created blobs).** Blobs created by Hudi's 
write path and stored
+under `{table}/.hoodie/blobs/{partition}/{col}/{instant}/{blob_id}`. The path 
structure guarantees
+uniqueness (C11), file-group scoping, and eliminates cross-FG sharing for 
normal writes. This is the
+expected majority flow for Phase 3 workloads.
+
+**Flow 2: Non-path-dispatched (user-provided external blobs).** Users have 
existing blob files in
+external storage (e.g., `s3://media-bucket/videos/`, a shared NFS mount, or 
any user-controlled
+path). Records reference these blobs directly by path. The user does **not** 
want to bootstrap --
+they do not want Hudi to copy, move, or reorganize the blob files into 
`.hoodie/blobs/`. Hudi
+manages the *references*, not the *storage layout*. This is the expected 
primary flow for Phase 1
+workloads and remains a supported flow in Phase 3.
+
+The non-path-dispatched flow has fundamentally different properties:
+
+| Property                  | Path-dispatched (Hudi-created)    | 
Non-path-dispatched (external)       |
+|---------------------------|-----------------------------------|--------------------------------------|
+| Path uniqueness           | Guaranteed (instant in path, C11) | Not 
guaranteed (user controls)       |
+| Cross-FG sharing          | Does not occur (FG-scoped)        | Common 
(multiple records, same blob) |
+| Writer/cleaner race       | Cannot occur (D2)                 | Can occur 
(D3)                       |
+| Delete-and-re-add (C2)    | Eliminated                        | Real concern 
                        |
+| Volume                    | Scales with writes                | Can be large 
from day one            |
+| Per-FG cleanup sufficient | Yes                               | No -- 
cross-FG verification needed   |
+
+Any solution that treats the non-path-dispatched flow as a rare edge case will 
fail at scale for
+Phase 1 workloads. The cleanup algorithm must be efficient for **both** flows 
independently, and
+must not impose the cost structure of one flow on the other.
+
+### Out of scope
+
+- **Inline blobs.** Inline blob data lives inside the base/log file and is 
deleted when the file
+  slice is cleaned. No additional cleanup needed.
+- **Blob compaction internals.** Blob compaction (repacking partially-live 
container files) is a
+  separate service. This document defines the interface point (when to hand 
off to blob compaction)
+  but not its internal design.
+- **Schema evolution.** Adding or removing blob columns does not change the 
cleanup problem.
+
+### Stance on the `managed` flag
+
+The BlobReference schema includes a `managed` boolean field
+(`HoodieSchema.Blob.EXTERNAL_REFERENCE_IS_MANAGED`). The RFC states that only 
managed blobs are
+cleaned. This document acknowledges the flag and treats it as a **filter** -- 
unmanaged blobs are
+excluded from cleanup consideration. However, the cleanup design must be 
**correct regardless of the
+flag's value**. The flag selects *which* blobs enter the cleanup pipeline; it 
must not be used as a
+correctness lever within the pipeline itself. The flag may later serve as an 
optimization (skip
+cleanup work for unmanaged blobs), but the problem statement and any solution 
must not depend on it
+for safety.
+
+---
+
+## 3. Background: Existing Cleaner
+
+The existing Hudi cleaner provides the execution framework that blob cleanup 
must integrate with.
+
+### Plan-execute model
+
+Cleaning is a two-phase operation:
+
+1. **Plan** (`CleanPlanner`): For each partition and file group, determine 
which file slices are

Review Comment:
   Uhm, this is just a high-level description of how cleaning as a recap. No 
new design specifics. It's just a recap of what we want to focus on. 
   
   This isn't part of the maindoc anyways, it is an appendix. 
   
   Will strip accordingly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(blob): Supplementing RFC-100 with blob cleaner design [hudi]

Reply via email to