rahil-c commented on code in PR #18359: URL: https://github.com/apache/hudi/pull/18359#discussion_r2965886303
########## rfc/rfc-100/rfc-100-blob-cleaner-design.md: ########## @@ -0,0 +1,749 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# RFC-100 Part 2: Blob Cleanup for Unstructured Data + +## Proposers + +- @voon + +## Approvers + +- (TBD) + +## Status + +Issue: <Link to GH feature issue> + +> Please keep the status updated in `rfc/README.md`. + +--- + +## Abstract + +When Hudi cleans expired file slices, out-of-line blob files they reference may become orphaned -- +still consuming storage but unreachable by any query. This RFC extends the existing file slice +cleaner to identify and delete these orphaned blob files safely and efficiently. The design uses a +three-stage pipeline: (1) per-file-group set-difference to find locally-orphaned blobs, (2) an MDT +secondary index lookup for cross-file-group verification of externally-referenced blobs, and (3) +container file lifecycle resolution. For Hudi-created blobs, cleanup is essentially free -- structural +path uniqueness eliminates cross-file-group concerns entirely. For user-provided external blobs, +targeted index lookups scale with the number of candidates, not the table size. Tables without blob +columns pay zero cost. + +--- + +## Background + +### Why Blob Cleanup Is Needed + +RFC-100 introduces out-of-line blob storage for unstructured data (images, video, documents). A +record's `BlobReference` field points to an external blob file by `(path, offset, length)`. When +the cleaner expires old file slices, the blob files they reference may no longer be needed -- but the +existing cleaner has no concept of transitive references. It deletes file slices without considering +the blob files they point to. Without blob cleanup, orphaned blobs accumulate indefinitely. + +### Two Blob Flows + +Blob cleanup must support two distinct entry flows with fundamentally different properties: + +**Flow 1 -- Hudi-created blobs.** Blobs created by Hudi's write path, stored at +`{table}/.hoodie/blobs/{partition}/{col}/{instant}/{blob_id}`. The commit instant in the path +guarantees uniqueness (C11), and blobs are scoped to a single file group (P3). Cross-file-group +sharing does not occur. This is the expected majority flow for Phase 3 workloads. + +**Flow 2 -- User-provided external blobs.** Users have existing blob files in external storage +(e.g., `s3://media-bucket/videos/`). Records reference these blobs directly by path. Hudi manages +the *references*, not the *storage layout*. Cross-file-group sharing is common -- multiple records +across different file groups can point to the same blob. This is the expected primary flow for +Phase 1 workloads. + +| Property | Flow 1 (Hudi-created) | Flow 2 (External) | +|---------------------------|-----------------------------------|--------------------------------------| +| Path uniqueness | Guaranteed (instant in path, C11) | Not guaranteed (user controls) | +| Cross-FG sharing | Does not occur (FG-scoped) | Common (multiple records, same blob) | +| Writer/cleaner race | Cannot occur (D2) | Can occur (D3) | +| Per-FG cleanup sufficient | Yes | No -- cross-FG verification needed | + +### Constraints and Requirements Reference + +Full descriptions and failure modes in [Appendix B](rfc-100-blob-cleaner-problem.md). + +| ID | Constraint | Flow 1 | Flow 2 | Remarks | +|-----|-------------------------------------------------|--------|--------|------------------------------| +| C1 | Blob immutability (append-once, read-many) | Y | Y | | +| C2 | Delete-and-re-add same path | -- | Y | Eliminated for Flow 1 by C11 | +| C3 | Cross-file-group blob sharing | -- | Y | Common for external blobs | +| C4 | Container files (`(offset, length)` ranges) | Y | Y | | +| C5 | MOR log updates shadow base file blob refs | Y | Y | | +| C6 | Existing cleaner is per-file-group scoped | Y | Y | | +| C7 | OCC is per-file-group | Y | Y | No global contention allowed | +| C8 | Clustering moves blob refs between file groups | Y | Y | | +| C9 | Savepoints freeze file slices and blob refs | Y | Y | | +| C10 | Rollback can invalidate or resurrect references | Y | Y | | +| C11 | Blob paths include commit instant | Y | -- | Eliminates C2, C3, C13 | +| C12 | Archival removes commit metadata | Y | Y | | +| C13 | Cross-FG verification needed at scale | -- | Y | | + +| ID | Requirement | +|-----|------------------------------------------------------------------| +| R1 | No premature deletion (hard invariant) | +| R2 | No permanent orphans (bounded cleanup) | +| R3 | Container awareness (range-level liveness) | +| R4 | MOR correctness (over-retention acceptable, under-retention not) | +| R5 | Concurrency safety (no global serialization) | +| R6 | Scale proportional to work, not table size | +| R7 | No cost for non-blob tables | +| R8 | All cleaning policies supported | +| R9 | Crash safety and idempotency | +| R10 | Observability (metrics for deleted, retained, reclaimed) | + +--- + +## Design Overview + +### Design Philosophy + +Blob cleanup extends the existing `CleanPlanner` / `CleanActionExecutor` pipeline -- same timeline +instant, same plan-execute-complete lifecycle, same crash recovery and OCC integration. A +`hasBlobColumns()` check gates all blob logic so non-blob tables pay zero cost. + +The two flows have different cost structures, and the design keeps them separate. Flow 1 +(Hudi-created blobs) gets per-FG cleanup with no cross-FG overhead. Flow 2 (external blobs) gets +targeted cross-FG verification via MDT secondary index. Dispatch is a string prefix check on the +blob path. + +### Three-Stage Pipeline + +| Stage | Scope | Purpose | When it runs | +|-------------|----------------------|----------------------------------------------------------------------------------|----------------------------------------| +| **Stage 1** | Per-file-group | Collect expired/retained blob refs, compute set difference, dispatch by category | Always (for blob tables) | +| **Stage 2** | Cross-file-group | Verify external blob candidates against MDT secondary index or fallback scan | Only when external candidates exist | +| **Stage 3** | Container resolution | Determine delete vs. flag-for-compaction at the container level | Only when container blobs are involved | Review Comment: @voonhous forgive me if this is a beginner question but what does "container" blob mean? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
