Re: [PR] fix: Enable schema merging for Parquet and ORC in S3/GCS incremental sources [hudi]

via GitHub Tue, 05 May 2026 10:37:43 -0700


hudi-agent commented on code in PR #18385:
URL: https://github.com/apache/hudi/pull/18385#discussion_r3190391520



##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java:
##########
@@ -41,9 +41,13 @@ public class ParquetDFSSourceConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> PARQUET_DFS_MERGE_SCHEMA = 
ConfigProperty
       .key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.merge_schema.enable")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   🤖 Just to confirm — flipping this default from `false` to `true` also 
affects the non-cloud `ParquetDFSSource` path (this config is used there too), 
not only S3/GCS incremental sources discussed in the PR description. Is the 
broader behavior change for plain `ParquetDFSSource` users intentional? They'll 
now incur footer reads across all files in a batch and may see different 
resolved schemas than before. If yes, worth calling out in the release notes; 
if not, you may want to keep the DFS default at `false` and only flip the 
cloud-incremental key.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/ORCDFSSource.java:
##########
@@ -53,6 +56,7 @@ public Pair<Option<Dataset<Row>>, Checkpoint> 
fetchNextBatch(Option<Checkpoint>
   }
 
   private Dataset<Row> fromFiles(String pathStr) {
-    return sparkSession.read().orc(pathStr.split(","));
+    boolean mergeSchemaOption = getBooleanWithAltKeys(this.props, 
ORCDFSSourceConfig.ORC_DFS_MERGE_SCHEMA);

Review Comment:
   🤖 nit: `mergeSchemaOption` is a `boolean`, but `Option` is a Hudi type 
elsewhere in this codebase — could you rename to `mergeSchemaEnabled` (or just 
inline it) to avoid the visual clash?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/CloudSourceConfig.java:
##########
@@ -181,4 +181,22 @@ public class CloudSourceConfig extends HoodieConfig {
       .markAdvanced()
       .sinceVersion("1.0.0")
       .withDocumentation("Boolean value to allow coalesce alias columns with 
actual columns while reading from source");
+
+  public static final ConfigProperty<Boolean> CLOUD_INCREMENTAL_MERGE_SCHEMA = 
ConfigProperty
+      .key(STREAMER_CONFIG_PREFIX + "source.cloud.data.merge.schema")
+      .defaultValue(true)

Review Comment:
   🤖 nit: this key is `source.cloud.data.merge.schema`, but the sibling DFS 
configs use `source.{parquet,orc}.dfs.merge_schema.enable`. Could we align them 
(e.g. `source.cloud.data.merge_schema.enable`) so future readers don't have to 
remember two patterns for the same concept?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: Enable schema merging for Parquet and ORC in S3/GCS incremental sources [hudi]

Reply via email to