rahil-c opened a new pull request, #18728:
URL: https://github.com/apache/hudi/pull/18728

   ## Summary
   
   Adds a new **Read Modes** subsection to RFC-100 that clarifies how blob 
columns are returned to the caller, so engineers and users can quickly tell 
which path materializes bytes vs. which returns a descriptor.
   
   Specifically, this PR documents:
   
   - `read_blob(<blob_column>)` is the **universal, canonical API** for 
materializing raw blob bytes. It always returns bytes regardless of:
     - Storage strategy (`INLINE` vs `OUT_OF_LINE`)
     - Base file format (Parquet, Lance, …)
     - Any reader-side config such as `hoodie.read.blob.inline.mode`
   - The existing reader config `hoodie.read.blob.inline.mode` (defined in 
`HoodieReaderConfig.java`, values `CONTENT` (default) | `DESCRIPTOR`) and how 
it interacts with `SELECT *`:
     - `CONTENT`: inline bytes are eagerly materialized into the struct's 
`data` field.
     - `DESCRIPTOR`: engine returns an `OUT_OF_LINE`-shaped descriptor in 
`reference` where the underlying file format supports it (Lance today). For 
formats with no native inline descriptor (Parquet), both `data` and `reference` 
are `NULL` and the caller must use `read_blob`.
   - For `OUT_OF_LINE` storage the `inline.read.mode` config is irrelevant — 
`SELECT *` always returns a populated `reference` and `NULL` `data`.
   
   The new subsection includes a behavior matrix table and an ASCII diagram of 
the two read paths.
   
   ## Why
   
   RFC-100's current Reader section describes the high-level intent (lazy 
loading, `read_blob`) but does not specify what users see for each combination 
of storage strategy, file format, and read mode. The Parquet vs. Lance 
difference under `DESCRIPTOR` mode is especially non-obvious. This is a 
doc-only clarification — no code changes.
   
   ## Changes
   
   - `rfc/rfc-100/rfc-100.md`: new **Read Modes: `read_blob` vs. `SELECT *`** 
subsection inserted after the existing `read_blob` SQL example, before the "3. 
Writer" heading.
   
   ## Test plan
   
   - [ ] Render `rfc/rfc-100/rfc-100.md` on GitHub and confirm the new 
subsection appears with an aligned table and the ASCII diagram inside a fenced 
code block.
   - [ ] Cross-check that the documented config name, default value, and enum 
values match 
`hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java`
 (no drift between RFC and code).
   - [ ] Re-read the new section as a first-time reader: `read_blob` is 
unambiguously the "always works" path; Parquet-vs-Lance difference in 
`DESCRIPTOR` mode is clear; OUT_OF_LINE row makes clear `inline.read.mode` does 
not apply.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to