yihua commented on code in PR #18867:
URL: https://github.com/apache/hudi/pull/18867#discussion_r3314697674


##########
website/docs/blob_unstructured_data.md:
##########
@@ -290,15 +292,44 @@ Out-of-line BLOBs keep the Hudi table footprint extremely 
small:
 
 | Property | Default | Description |
 |:---------|:--------|:------------|
-| `hoodie.read.blob.inline.mode` | `CONTENT` | Controls how INLINE BLOBs are 
read. `CONTENT` materializes raw bytes in the `data` column. `DESCRIPTOR` 
surfaces `(position, size)` coordinates rewritten as OUT_OF_LINE references. |
+| `hoodie.read.blob.inline.mode` | `DESCRIPTOR` | Controls how INLINE BLOBs 
are read. `DESCRIPTOR` (default) returns an out-of-line-shaped reference 
pointing at the in-file coordinates of the bytes — no bytes are materialized. 
`CONTENT` materializes the raw inline bytes directly in the `data` field on 
every read. |
 | `hoodie.blob.batching.max.gap.bytes` | `4096` | Maximum gap (in bytes) 
between consecutive byte ranges before they are merged into a single read. 
Larger values reduce I/O calls at the cost of reading some unused bytes. |
 | `hoodie.blob.batching.lookahead.size` | `50` | Number of rows to buffer for 
batch read detection. Larger values improve batching for sorted data but 
increase memory usage. |
 
 :::note
-DESCRIPTOR mode is only supported on Lance-backed tables. CONTENT mode is 
always used for internal
-operations (compaction, merge, log replay) regardless of this setting.
+`DESCRIPTOR` mode is the default for all storage formats including Lance. 
`CONTENT` mode is always
+used for internal operations (compaction, merge, log replay) regardless of 
this setting.
 :::
 
+:::caution Calling read_blob() on INLINE columns under DESCRIPTOR mode
+Under the default `DESCRIPTOR` mode, calling `read_blob()` on an INLINE BLOB 
column returns a
+descriptor reference rather than the raw bytes — it does **not** materialize 
the content. To read
+inline bytes with `read_blob()`, set `hoodie.read.blob.inline.mode=CONTENT`:
+
+```sql
+SET hoodie.read.blob.inline.mode=CONTENT;
+SELECT asset_id, read_blob(content) AS raw_bytes
+FROM media_assets
+WHERE asset_id = 'asset_001';
+```
+
+This setting affects only INLINE columns — OUT_OF_LINE columns always fetch 
from the external path
+regardless of mode.
+:::

Review Comment:
   Revisit this part for clarity



##########
website/docs/blob_unstructured_data.md:
##########
@@ -290,15 +292,44 @@ Out-of-line BLOBs keep the Hudi table footprint extremely 
small:
 
 | Property | Default | Description |
 |:---------|:--------|:------------|
-| `hoodie.read.blob.inline.mode` | `CONTENT` | Controls how INLINE BLOBs are 
read. `CONTENT` materializes raw bytes in the `data` column. `DESCRIPTOR` 
surfaces `(position, size)` coordinates rewritten as OUT_OF_LINE references. |
+| `hoodie.read.blob.inline.mode` | `DESCRIPTOR` | Controls how INLINE BLOBs 
are read. `DESCRIPTOR` (default) returns an out-of-line-shaped reference 
pointing at the in-file coordinates of the bytes — no bytes are materialized. 
`CONTENT` materializes the raw inline bytes directly in the `data` field on 
every read. |

Review Comment:
   Reminder to add nuances around file format support (Parquet vs Lance, any 
usability notes)



##########
website/docs/blob_unstructured_data.md:
##########
@@ -290,15 +292,44 @@ Out-of-line BLOBs keep the Hudi table footprint extremely 
small:
 
 | Property | Default | Description |
 |:---------|:--------|:------------|
-| `hoodie.read.blob.inline.mode` | `CONTENT` | Controls how INLINE BLOBs are 
read. `CONTENT` materializes raw bytes in the `data` column. `DESCRIPTOR` 
surfaces `(position, size)` coordinates rewritten as OUT_OF_LINE references. |
+| `hoodie.read.blob.inline.mode` | `DESCRIPTOR` | Controls how INLINE BLOBs 
are read. `DESCRIPTOR` (default) returns an out-of-line-shaped reference 
pointing at the in-file coordinates of the bytes — no bytes are materialized. 
`CONTENT` materializes the raw inline bytes directly in the `data` field on 
every read. |

Review Comment:
   This is a new config added in release 1.2.0.  There is no need to call out 
upgrade.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to