This is an automated email from the ASF dual-hosted git repository.
yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/master by this push:
new 50efb2edc51b feat(blob): RFC-100: Clarify inline vs out-of-line blob
read behavior (#18728)
50efb2edc51b is described below
commit 50efb2edc51b9cb94511cb18e63c80b006100f99
Author: Rahil C <[email protected]>
AuthorDate: Thu May 14 15:05:31 2026 -0700
feat(blob): RFC-100: Clarify inline vs out-of-line blob read behavior
(#18728)
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Co-authored-by: Y Ethan Guo <[email protected]>
---
rfc/rfc-100/rfc-100.md | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)
diff --git a/rfc/rfc-100/rfc-100.md b/rfc/rfc-100/rfc-100.md
index e66ab97c57a4..2c637ccb2db2 100644
--- a/rfc/rfc-100/rfc-100.md
+++ b/rfc/rfc-100/rfc-100.md
@@ -128,6 +128,67 @@ For Spark SQL we will provide a function that the user can
leverage to materiali
SELECT id, url, read_blob(image_blob) as image_bytes FROM my_table;
```
+#### Read Modes: `read_blob` vs. `SELECT *`
+
+`read_blob(<blob_column>)` is the canonical, universal API for materializing
raw blob bytes in a query. It always returns the underlying `bytes` regardless
of:
+- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
+- Base file format (Parquet, Lance, …)
+- Any reader-side config such as `hoodie.read.blob.inline.mode`
+
+Selecting the blob column directly (e.g. `SELECT image_blob FROM t` or `SELECT
*`) returns the underlying `Blob` struct as-is. The contents of that struct
depend on the storage strategy, the file format, and the read mode, as
summarized below.
+
+**Reader Configuration**
+
+- `hoodie.read.blob.inline.mode` — values `CONTENT` (default) | `DESCRIPTOR`.
+ - `CONTENT`: the engine eagerly materializes inline bytes into the struct's
`data` field.
+ - `DESCRIPTOR`: the engine returns an `OUT_OF_LINE`-shaped descriptor in the
`reference` field where the underlying file format supports it (Lance today),
enabling lazy byte materialization via `read_blob`. For file formats without a
native descriptor for inline payloads (Parquet), both `data` and `reference`
are returned `NULL`, and the caller must use `read_blob` to retrieve bytes.
+ - This config governs `INLINE` reads only. For `OUT_OF_LINE` storage, the
engine always returns a populated `reference` regardless of this setting.
+
+**Behavior matrix**
+
+| Access pattern | Storage | File format |
`hoodie.read.blob.inline.mode` | `data` field | `reference` field |
Raw bytes available? |
+|------------------|--------------|-------------|--------------------------------|--------------|------------------------------|---------------------------------------------------|
+| `SELECT read_blob(col) FROM table` | INLINE | Parquet | (any)
| n/a | n/a | Yes —
returns bytes |
+| `SELECT read_blob(col) FROM table` | INLINE | Lance | (any)
| n/a | n/a | Yes —
returns bytes |
+| `SELECT read_blob(col) FROM table` | OUT_OF_LINE | (any) | (any)
| n/a | n/a | Yes —
returns bytes |
+| `SELECT col FROM table` | INLINE | Parquet | `CONTENT`
(default) | bytes | NULL | Yes — via
`data` |
+| `SELECT col FROM table` | INLINE | Parquet | `DESCRIPTOR`
| **NULL** | **NULL** | No — must call
`read_blob` |
+| `SELECT col FROM table` | INLINE | Lance | `CONTENT`
(default) | bytes | NULL | Yes — via
`data` |
+| `SELECT col FROM table` | INLINE | Lance | `DESCRIPTOR`
| NULL | populated (Lance blob enc.) | No — descriptor
visible; use `read_blob` for bytes|
+| `SELECT col FROM table` | OUT_OF_LINE | (any) | (irrelevant)
| NULL | populated | No — must call
`read_blob` |
+
+**Why Parquet and Lance differ in `DESCRIPTOR` mode**
+
+Lance's native blob encoding stores blobs in a way that already exposes a
`(file, offset, length)` descriptor cheaply, so `DESCRIPTOR` mode surfaces it
directly in the `reference` field — effectively letting INLINE blobs be read
with the same deferred-materialization path used for OUT_OF_LINE references.
Parquet has no equivalent native descriptor for an inline byte array, so both
fields are `NULL` in `DESCRIPTOR` mode and the caller must use `read_blob` to
materialize bytes.
+
+**Visual**
+
+```
+ ┌──────────────────────────────────────────────────────────────────┐
+ │ read_blob(col) ── universal, always materializes bytes ──│
+ │ │ │
+ │ ▼ │
+ │ ┌─────────────┐ INLINE ───► read inline payload │
+ │ │ Hudi reader │ ──┤ │
+ │ └─────────────┘ OUT_OF_LINE ► follow reference → read bytes │
+ └──────────────────────────────────────────────────────────────────┘
+
+ ┌──────────────────────────────────────────────────────────────────┐
+ │ SELECT col (returns Blob struct as-is) │
+ │ │ │
+ │ ▼ │
+ │ storage = OUT_OF_LINE ─────────────► data=NULL, reference=set │
+ │ │
+ │ storage = INLINE, │
+ │ inline.mode = CONTENT (default) ───► data=<bytes>, ref=NULL │
+ │ │
+ │ storage = INLINE, │
+ │ inline.mode = DESCRIPTOR │
+ │ ├─ Parquet ─────────────────────► data=NULL, ref=NULL │
+ │ └─ Lance ─────────────────────► data=NULL, ref=set │
+ └──────────────────────────────────────────────────────────────────┘
+```
+
### 3. Writer
#### Phase 1: External Blob Support
The writer will be updated to support writing blob data as out-of-line
references.