hudi-agent commented on code in PR #18385:
URL: https://github.com/apache/hudi/pull/18385#discussion_r3190391520
##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java:
##########
@@ -41,9 +41,13 @@ public class ParquetDFSSourceConfig extends HoodieConfig {
public static final ConfigProperty<Boolean> PARQUET_DFS_MERGE_SCHEMA =
ConfigProperty
.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.merge_schema.enable")
- .defaultValue(false)
+ .defaultValue(true)
Review Comment:
🤖 Just to confirm — flipping this default from `false` to `true` also
affects the non-cloud `ParquetDFSSource` path (this config is used there too),
not only S3/GCS incremental sources discussed in the PR description. Is the
broader behavior change for plain `ParquetDFSSource` users intentional? They'll
now incur footer reads across all files in a batch and may see different
resolved schemas than before. If yes, worth calling out in the release notes;
if not, you may want to keep the DFS default at `false` and only flip the
cloud-incremental key.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/ORCDFSSource.java:
##########
@@ -53,6 +56,7 @@ public Pair<Option<Dataset<Row>>, Checkpoint>
fetchNextBatch(Option<Checkpoint>
}
private Dataset<Row> fromFiles(String pathStr) {
- return sparkSession.read().orc(pathStr.split(","));
+ boolean mergeSchemaOption = getBooleanWithAltKeys(this.props,
ORCDFSSourceConfig.ORC_DFS_MERGE_SCHEMA);
Review Comment:
🤖 nit: `mergeSchemaOption` is a `boolean`, but `Option` is a Hudi type
elsewhere in this codebase — could you rename to `mergeSchemaEnabled` (or just
inline it) to avoid the visual clash?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/CloudSourceConfig.java:
##########
@@ -181,4 +181,22 @@ public class CloudSourceConfig extends HoodieConfig {
.markAdvanced()
.sinceVersion("1.0.0")
.withDocumentation("Boolean value to allow coalesce alias columns with
actual columns while reading from source");
+
+ public static final ConfigProperty<Boolean> CLOUD_INCREMENTAL_MERGE_SCHEMA =
ConfigProperty
+ .key(STREAMER_CONFIG_PREFIX + "source.cloud.data.merge.schema")
+ .defaultValue(true)
Review Comment:
🤖 nit: this key is `source.cloud.data.merge.schema`, but the sibling DFS
configs use `source.{parquet,orc}.dfs.merge_schema.enable`. Could we align them
(e.g. `source.cloud.data.merge_schema.enable`) so future readers don't have to
remember two patterns for the same concept?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]