yihua commented on code in PR #18683:
URL: https://github.com/apache/hudi/pull/18683#discussion_r3245554944
##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java:
##########
@@ -111,9 +111,21 @@ public class HoodieReaderConfig extends HoodieConfig {
.markAdvanced()
.sinceVersion("1.2.0")
.withValidValues(BLOB_INLINE_READ_MODE_CONTENT,
BLOB_INLINE_READ_MODE_DESCRIPTOR)
- .withDocumentation("How Hudi interprets INLINE BLOB values on read. "
- + "CONTENT (default) returns the raw inline bytes. "
- + "DESCRIPTOR returns an OUT_OF_LINE-shaped reference pointing at
the backing "
- + "Lance file with the INLINE payload's position and size, so
callers can defer "
- + "the byte read via read_blob().");
+ .withDocumentation("How Hudi interprets INLINE BLOB values on read for
plain column access "
+ + "(e.g. SELECT *). "
+ + "CONTENT (default) returns the raw inline bytes in the data field.
"
+ + "DESCRIPTOR suppresses the inline bytes (data field is null) so
direct column reads "
+ + "avoid the I/O cost of materializing large binary payloads. "
+ + "For Lance files, the reference struct is populated with blob
stream coordinates. "
+ + "For Parquet files, the data column is skipped via nested column
projection and the "
+ + "reference struct is null. "
+ + "read_blob() is the canonical bytes-materializing API and always
returns bytes "
+ + "regardless of this setting; under DESCRIPTOR mode the engine
reads the data column "
+ + "only for the blob columns referenced by read_blob() in the
query.");
+
+ // Internal-only key set by ReadBlobRule on a per-query
HadoopFsRelation.options to instruct
+ // the reader to skip the DESCRIPTOR data-column strip for the listed blob
columns, so that
+ // read_blob() sees the materialized bytes. Comma-separated top-level column
names. Not user-facing.
+ public static final String BLOB_INLINE_READ_FORCE_CONTENT_COLUMNS =
+ "hoodie.internal.read.blob.inline.force.content.columns";
Review Comment:
Thus, all blob columns are read in `DESCRIPTOR` mode if no `read_blob(col)`
is used. All blob columns are read in `CONTENT` mode if any `read_blob(col)`
exists in the query.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]