rahil-c opened a new pull request, #18728:
URL: https://github.com/apache/hudi/pull/18728
## Summary
Adds a new **Read Modes** subsection to RFC-100 that clarifies how blob
columns are returned to the caller, so engineers and users can quickly tell
which path materializes bytes vs. which returns a descriptor.
Specifically, this PR documents:
- `read_blob(<blob_column>)` is the **universal, canonical API** for
materializing raw blob bytes. It always returns bytes regardless of:
- Storage strategy (`INLINE` vs `OUT_OF_LINE`)
- Base file format (Parquet, Lance, …)
- Any reader-side config such as `hoodie.read.blob.inline.mode`
- The existing reader config `hoodie.read.blob.inline.mode` (defined in
`HoodieReaderConfig.java`, values `CONTENT` (default) | `DESCRIPTOR`) and how
it interacts with `SELECT *`:
- `CONTENT`: inline bytes are eagerly materialized into the struct's
`data` field.
- `DESCRIPTOR`: engine returns an `OUT_OF_LINE`-shaped descriptor in
`reference` where the underlying file format supports it (Lance today). For
formats with no native inline descriptor (Parquet), both `data` and `reference`
are `NULL` and the caller must use `read_blob`.
- For `OUT_OF_LINE` storage the `inline.read.mode` config is irrelevant —
`SELECT *` always returns a populated `reference` and `NULL` `data`.
The new subsection includes a behavior matrix table and an ASCII diagram of
the two read paths.
## Why
RFC-100's current Reader section describes the high-level intent (lazy
loading, `read_blob`) but does not specify what users see for each combination
of storage strategy, file format, and read mode. The Parquet vs. Lance
difference under `DESCRIPTOR` mode is especially non-obvious. This is a
doc-only clarification — no code changes.
## Changes
- `rfc/rfc-100/rfc-100.md`: new **Read Modes: `read_blob` vs. `SELECT *`**
subsection inserted after the existing `read_blob` SQL example, before the "3.
Writer" heading.
## Test plan
- [ ] Render `rfc/rfc-100/rfc-100.md` on GitHub and confirm the new
subsection appears with an aligned table and the ASCII diagram inside a fenced
code block.
- [ ] Cross-check that the documented config name, default value, and enum
values match
`hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java`
(no drift between RFC and code).
- [ ] Re-read the new section as a first-time reader: `read_blob` is
unambiguously the "always works" path; Parquet-vs-Lance difference in
`DESCRIPTOR` mode is clear; OUT_OF_LINE row makes clear `inline.read.mode` does
not apply.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]