yihua commented on code in PR #18683:
URL: https://github.com/apache/hudi/pull/18683#discussion_r3244172210


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java:
##########
@@ -111,9 +111,21 @@ public class HoodieReaderConfig extends HoodieConfig {
       .markAdvanced()
       .sinceVersion("1.2.0")
       .withValidValues(BLOB_INLINE_READ_MODE_CONTENT, 
BLOB_INLINE_READ_MODE_DESCRIPTOR)
-      .withDocumentation("How Hudi interprets INLINE BLOB values on read. "
-          + "CONTENT (default) returns the raw inline bytes. "
-          + "DESCRIPTOR returns an OUT_OF_LINE-shaped reference pointing at 
the backing "
-          + "Lance file with the INLINE payload's position and size, so 
callers can defer "
-          + "the byte read via read_blob().");
+      .withDocumentation("How Hudi interprets INLINE BLOB values on read for 
plain column access "
+          + "(e.g. SELECT *). "

Review Comment:
   ```suggestion
             + "(e.g. SELECT blob_column FROM table). "
   ```



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java:
##########
@@ -111,9 +111,21 @@ public class HoodieReaderConfig extends HoodieConfig {
       .markAdvanced()
       .sinceVersion("1.2.0")
       .withValidValues(BLOB_INLINE_READ_MODE_CONTENT, 
BLOB_INLINE_READ_MODE_DESCRIPTOR)
-      .withDocumentation("How Hudi interprets INLINE BLOB values on read. "
-          + "CONTENT (default) returns the raw inline bytes. "
-          + "DESCRIPTOR returns an OUT_OF_LINE-shaped reference pointing at 
the backing "
-          + "Lance file with the INLINE payload's position and size, so 
callers can defer "
-          + "the byte read via read_blob().");
+      .withDocumentation("How Hudi interprets INLINE BLOB values on read for 
plain column access "
+          + "(e.g. SELECT *). "
+          + "CONTENT (default) returns the raw inline bytes in the data field. 
"
+          + "DESCRIPTOR suppresses the inline bytes (data field is null) so 
direct column reads "
+          + "avoid the I/O cost of materializing large binary payloads. "
+          + "For Lance files, the reference struct is populated with blob 
stream coordinates. "
+          + "For Parquet files, the data column is skipped via nested column 
projection and the "
+          + "reference struct is null. "
+          + "read_blob() is the canonical bytes-materializing API and always 
returns bytes "
+          + "regardless of this setting; under DESCRIPTOR mode the engine 
reads the data column "
+          + "only for the blob columns referenced by read_blob() in the 
query.");
+
+  // Internal-only key set by ReadBlobRule on a per-query 
HadoopFsRelation.options to instruct
+  // the reader to skip the DESCRIPTOR data-column strip for the listed blob 
columns, so that
+  // read_blob() sees the materialized bytes. Comma-separated top-level column 
names. Not user-facing.
+  public static final String BLOB_INLINE_READ_FORCE_CONTENT_COLUMNS =
+      "hoodie.internal.read.blob.inline.force.content.columns";

Review Comment:
   My understanding is that this internal config intends to differentiate the 
blob column reading of `read_blob(col1)` vs `col2` in `SELECT read_blob(col1), 
col2 FROM table` so that col1 is read out with CONTENT and col2 is read out 
with DESCRIPTOR.  However, Lance reader can only be configured with `CONTENT` 
or `DESCRIPTOR` mode for all blob columns.  Should we revisit the expected 
behavior of `hoodie.read.blob.inline.mode`?
   
   To simplify the experience, it makes sense for user to only have 
`read_blob(col)` or `col` in the same query where `col` is a blob column.  If a 
mix of both exist, all blob columns are read out with content.



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java:
##########
@@ -111,9 +111,21 @@ public class HoodieReaderConfig extends HoodieConfig {
       .markAdvanced()
       .sinceVersion("1.2.0")
       .withValidValues(BLOB_INLINE_READ_MODE_CONTENT, 
BLOB_INLINE_READ_MODE_DESCRIPTOR)
-      .withDocumentation("How Hudi interprets INLINE BLOB values on read. "
-          + "CONTENT (default) returns the raw inline bytes. "
-          + "DESCRIPTOR returns an OUT_OF_LINE-shaped reference pointing at 
the backing "
-          + "Lance file with the INLINE payload's position and size, so 
callers can defer "
-          + "the byte read via read_blob().");
+      .withDocumentation("How Hudi interprets INLINE BLOB values on read for 
plain column access "
+          + "(e.g. SELECT *). "
+          + "CONTENT (default) returns the raw inline bytes in the data field. 
"

Review Comment:
   We should flip the default to `DESCRIPTOR` and let all table services to 
still use `CONTENT` mode.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to