Re: [PR] feat: Add Parquet DESCRIPTOR mode for blob inline reading [hudi]

via GitHub Thu, 14 May 2026 19:55:36 -0700


yihua commented on code in PR #18683:
URL: https://github.com/apache/hudi/pull/18683#discussion_r3245554944



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java:
##########
@@ -111,9 +111,21 @@ public class HoodieReaderConfig extends HoodieConfig {
       .markAdvanced()
       .sinceVersion("1.2.0")
       .withValidValues(BLOB_INLINE_READ_MODE_CONTENT, 
BLOB_INLINE_READ_MODE_DESCRIPTOR)
-      .withDocumentation("How Hudi interprets INLINE BLOB values on read. "
-          + "CONTENT (default) returns the raw inline bytes. "
-          + "DESCRIPTOR returns an OUT_OF_LINE-shaped reference pointing at 
the backing "
-          + "Lance file with the INLINE payload's position and size, so 
callers can defer "
-          + "the byte read via read_blob().");
+      .withDocumentation("How Hudi interprets INLINE BLOB values on read for 
plain column access "
+          + "(e.g. SELECT *). "
+          + "CONTENT (default) returns the raw inline bytes in the data field. 
"
+          + "DESCRIPTOR suppresses the inline bytes (data field is null) so 
direct column reads "
+          + "avoid the I/O cost of materializing large binary payloads. "
+          + "For Lance files, the reference struct is populated with blob 
stream coordinates. "
+          + "For Parquet files, the data column is skipped via nested column 
projection and the "
+          + "reference struct is null. "
+          + "read_blob() is the canonical bytes-materializing API and always 
returns bytes "
+          + "regardless of this setting; under DESCRIPTOR mode the engine 
reads the data column "
+          + "only for the blob columns referenced by read_blob() in the 
query.");
+
+  // Internal-only key set by ReadBlobRule on a per-query 
HadoopFsRelation.options to instruct
+  // the reader to skip the DESCRIPTOR data-column strip for the listed blob 
columns, so that
+  // read_blob() sees the materialized bytes. Comma-separated top-level column 
names. Not user-facing.
+  public static final String BLOB_INLINE_READ_FORCE_CONTENT_COLUMNS =
+      "hoodie.internal.read.blob.inline.force.content.columns";

Review Comment:
   Thus, all blob columns are read in `DESCRIPTOR` mode if no `read_blob(col)` 
is used.  All blob columns are read in `CONTENT` mode if any `read_blob(col)` 
exists in the query.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: Add Parquet DESCRIPTOR mode for blob inline reading [hudi]

Reply via email to