rahil-c commented on code in PR #18867: URL: https://github.com/apache/hudi/pull/18867#discussion_r3318999621
########## website/docs/blob_unstructured_data.md: ########## @@ -290,15 +292,44 @@ Out-of-line BLOBs keep the Hudi table footprint extremely small: | Property | Default | Description | |:---------|:--------|:------------| -| `hoodie.read.blob.inline.mode` | `CONTENT` | Controls how INLINE BLOBs are read. `CONTENT` materializes raw bytes in the `data` column. `DESCRIPTOR` surfaces `(position, size)` coordinates rewritten as OUT_OF_LINE references. | +| `hoodie.read.blob.inline.mode` | `DESCRIPTOR` | Controls how INLINE BLOBs are read. `DESCRIPTOR` (default) returns an out-of-line-shaped reference pointing at the in-file coordinates of the bytes — no bytes are materialized. `CONTENT` materializes the raw inline bytes directly in the `data` field on every read. | | `hoodie.blob.batching.max.gap.bytes` | `4096` | Maximum gap (in bytes) between consecutive byte ranges before they are merged into a single read. Larger values reduce I/O calls at the cost of reading some unused bytes. | | `hoodie.blob.batching.lookahead.size` | `50` | Number of rows to buffer for batch read detection. Larger values improve batching for sorted data but increase memory usage. | :::note -DESCRIPTOR mode is only supported on Lance-backed tables. CONTENT mode is always used for internal -operations (compaction, merge, log replay) regardless of this setting. +`DESCRIPTOR` mode is the default for all storage formats including Lance. `CONTENT` mode is always +used for internal operations (compaction, merge, log replay) regardless of this setting. ::: +:::caution Calling read_blob() on INLINE columns under DESCRIPTOR mode +Under the default `DESCRIPTOR` mode, calling `read_blob()` on an INLINE BLOB column returns a +descriptor reference rather than the raw bytes — it does **not** materialize the content. To read +inline bytes with `read_blob()`, set `hoodie.read.blob.inline.mode=CONTENT`: Review Comment: Yes this is important callout to make, read_blob() for now only works on CONTENT mode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
