yihua commented on code in PR #18867: URL: https://github.com/apache/hudi/pull/18867#discussion_r3314697674
########## website/docs/blob_unstructured_data.md: ########## @@ -290,15 +292,44 @@ Out-of-line BLOBs keep the Hudi table footprint extremely small: | Property | Default | Description | |:---------|:--------|:------------| -| `hoodie.read.blob.inline.mode` | `CONTENT` | Controls how INLINE BLOBs are read. `CONTENT` materializes raw bytes in the `data` column. `DESCRIPTOR` surfaces `(position, size)` coordinates rewritten as OUT_OF_LINE references. | +| `hoodie.read.blob.inline.mode` | `DESCRIPTOR` | Controls how INLINE BLOBs are read. `DESCRIPTOR` (default) returns an out-of-line-shaped reference pointing at the in-file coordinates of the bytes — no bytes are materialized. `CONTENT` materializes the raw inline bytes directly in the `data` field on every read. | | `hoodie.blob.batching.max.gap.bytes` | `4096` | Maximum gap (in bytes) between consecutive byte ranges before they are merged into a single read. Larger values reduce I/O calls at the cost of reading some unused bytes. | | `hoodie.blob.batching.lookahead.size` | `50` | Number of rows to buffer for batch read detection. Larger values improve batching for sorted data but increase memory usage. | :::note -DESCRIPTOR mode is only supported on Lance-backed tables. CONTENT mode is always used for internal -operations (compaction, merge, log replay) regardless of this setting. +`DESCRIPTOR` mode is the default for all storage formats including Lance. `CONTENT` mode is always +used for internal operations (compaction, merge, log replay) regardless of this setting. ::: +:::caution Calling read_blob() on INLINE columns under DESCRIPTOR mode +Under the default `DESCRIPTOR` mode, calling `read_blob()` on an INLINE BLOB column returns a +descriptor reference rather than the raw bytes — it does **not** materialize the content. To read +inline bytes with `read_blob()`, set `hoodie.read.blob.inline.mode=CONTENT`: + +```sql +SET hoodie.read.blob.inline.mode=CONTENT; +SELECT asset_id, read_blob(content) AS raw_bytes +FROM media_assets +WHERE asset_id = 'asset_001'; +``` + +This setting affects only INLINE columns — OUT_OF_LINE columns always fetch from the external path +regardless of mode. +::: Review Comment: Revisit this part for clarity ########## website/docs/blob_unstructured_data.md: ########## @@ -290,15 +292,44 @@ Out-of-line BLOBs keep the Hudi table footprint extremely small: | Property | Default | Description | |:---------|:--------|:------------| -| `hoodie.read.blob.inline.mode` | `CONTENT` | Controls how INLINE BLOBs are read. `CONTENT` materializes raw bytes in the `data` column. `DESCRIPTOR` surfaces `(position, size)` coordinates rewritten as OUT_OF_LINE references. | +| `hoodie.read.blob.inline.mode` | `DESCRIPTOR` | Controls how INLINE BLOBs are read. `DESCRIPTOR` (default) returns an out-of-line-shaped reference pointing at the in-file coordinates of the bytes — no bytes are materialized. `CONTENT` materializes the raw inline bytes directly in the `data` field on every read. | Review Comment: Reminder to add nuances around file format support (Parquet vs Lance, any usability notes) ########## website/docs/blob_unstructured_data.md: ########## @@ -290,15 +292,44 @@ Out-of-line BLOBs keep the Hudi table footprint extremely small: | Property | Default | Description | |:---------|:--------|:------------| -| `hoodie.read.blob.inline.mode` | `CONTENT` | Controls how INLINE BLOBs are read. `CONTENT` materializes raw bytes in the `data` column. `DESCRIPTOR` surfaces `(position, size)` coordinates rewritten as OUT_OF_LINE references. | +| `hoodie.read.blob.inline.mode` | `DESCRIPTOR` | Controls how INLINE BLOBs are read. `DESCRIPTOR` (default) returns an out-of-line-shaped reference pointing at the in-file coordinates of the bytes — no bytes are materialized. `CONTENT` materializes the raw inline bytes directly in the `data` field on every read. | Review Comment: This is a new config added in release 1.2.0. There is no need to call out upgrade. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
