This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch release-1.2.0 in repository https://gitbox.apache.org/repos/asf/hudi.git
commit 657405a944801634dccb23d42c32e3ddad347d74 Author: Rahil C <[email protected]> AuthorDate: Thu May 14 15:05:31 2026 -0700 feat(blob): RFC-100: Clarify inline vs out-of-line blob read behavior (#18728) Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]> Co-authored-by: Y Ethan Guo <[email protected]> --- rfc/rfc-100/rfc-100.md | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) diff --git a/rfc/rfc-100/rfc-100.md b/rfc/rfc-100/rfc-100.md index e66ab97c57a4..2c637ccb2db2 100644 --- a/rfc/rfc-100/rfc-100.md +++ b/rfc/rfc-100/rfc-100.md @@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can leverage to materiali SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table; ``` +#### Read Modes: `read_blob` vs. `SELECT *` + +`read_blob(<blob_column>)` is the canonical, universal API for materializing raw blob bytes in a query. It always returns the underlying `bytes` regardless of: +- Storage strategy (`INLINE` vs `OUT_OF_LINE`) +- Base file format (Parquet, Lance, …) +- Any reader-side config such as `hoodie.read.blob.inline.mode` + +Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT *`) returns the underlying `Blob` struct as-is. The contents of that struct depend on the storage strategy, the file format, and the read mode, as summarized below. + +**Reader Configuration** + +- `hoodie.read.blob.inline.mode` — values `CONTENT` (default) | `DESCRIPTOR`. + - `CONTENT`: the engine eagerly materializes inline bytes into the struct's `data` field. + - `DESCRIPTOR`: the engine returns an `OUT_OF_LINE`-shaped descriptor in the `reference` field where the underlying file format supports it (Lance today), enabling lazy byte materialization via `read_blob`. For file formats without a native descriptor for inline payloads (Parquet), both `data` and `reference` are returned `NULL`, and the caller must use `read_blob` to retrieve bytes. + - This config governs `INLINE` reads only. For `OUT_OF_LINE` storage, the engine always returns a populated `reference` regardless of this setting. + +**Behavior matrix** + +| Access pattern | Storage | File format | `hoodie.read.blob.inline.mode` | `data` field | `reference` field | Raw bytes available? | +|------------------|--------------|-------------|--------------------------------|--------------|------------------------------|---------------------------------------------------| +| `SELECT read_blob(col) FROM table` | INLINE | Parquet | (any) | n/a | n/a | Yes — returns bytes | +| `SELECT read_blob(col) FROM table` | INLINE | Lance | (any) | n/a | n/a | Yes — returns bytes | +| `SELECT read_blob(col) FROM table` | OUT_OF_LINE | (any) | (any) | n/a | n/a | Yes — returns bytes | +| `SELECT col FROM table` | INLINE | Parquet | `CONTENT` (default) | bytes | NULL | Yes — via `data` | +| `SELECT col FROM table` | INLINE | Parquet | `DESCRIPTOR` | **NULL** | **NULL** | No — must call `read_blob` | +| `SELECT col FROM table` | INLINE | Lance | `CONTENT` (default) | bytes | NULL | Yes — via `data` | +| `SELECT col FROM table` | INLINE | Lance | `DESCRIPTOR` | NULL | populated (Lance blob enc.) | No — descriptor visible; use `read_blob` for bytes| +| `SELECT col FROM table` | OUT_OF_LINE | (any) | (irrelevant) | NULL | populated | No — must call `read_blob` | + +**Why Parquet and Lance differ in `DESCRIPTOR` mode** + +Lance's native blob encoding stores blobs in a way that already exposes a `(file, offset, length)` descriptor cheaply, so `DESCRIPTOR` mode surfaces it directly in the `reference` field — effectively letting INLINE blobs be read with the same deferred-materialization path used for OUT_OF_LINE references. Parquet has no equivalent native descriptor for an inline byte array, so both fields are `NULL` in `DESCRIPTOR` mode and the caller must use `read_blob` to materialize bytes. + +**Visual** + +``` + ┌──────────────────────────────────────────────────────────────────┐ + │ read_blob(col) ── universal, always materializes bytes ──│ + │ │ │ + │ ▼ │ + │ ┌─────────────┐ INLINE ───► read inline payload │ + │ │ Hudi reader │ ──┤ │ + │ └─────────────┘ OUT_OF_LINE ► follow reference → read bytes │ + └──────────────────────────────────────────────────────────────────┘ + + ┌──────────────────────────────────────────────────────────────────┐ + │ SELECT col (returns Blob struct as-is) │ + │ │ │ + │ ▼ │ + │ storage = OUT_OF_LINE ─────────────► data=NULL, reference=set │ + │ │ + │ storage = INLINE, │ + │ inline.mode = CONTENT (default) ───► data=<bytes>, ref=NULL │ + │ │ + │ storage = INLINE, │ + │ inline.mode = DESCRIPTOR │ + │ ├─ Parquet ─────────────────────► data=NULL, ref=NULL │ + │ └─ Lance ─────────────────────► data=NULL, ref=set │ + └──────────────────────────────────────────────────────────────────┘ +``` + ### 3. Writer #### Phase 1: External Blob Support The writer will be updated to support writing blob data as out-of-line references.
