Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub


rohitmittapalli commented on code in PR #10199:
URL: https://github.com/apache/hudi/pull/10199#discussion_r1454158271


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.config;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import javax.annotation.concurrent.Immutable;
+
+import static 
org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX;
+import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX;
+
+/**
+ * Parquet DFS Source Configs
+ */
+@Immutable
+@ConfigClassProperty(name = "Parquet DFS Source Configs",
+groupName = ConfigGroups.Names.HUDI_STREAMER,
+subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE,
+description = "Configurations controlling the behavior of Parquet DFS 
source in Hudi Streamer.")
+public class ParquetDFSSourceConfig extends HoodieConfig {
+
+public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = 
ConfigProperty
+.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema")
+.defaultValue(true)

Review Comment:
   fine by me! will set to false by default then



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub


xushiyan commented on code in PR #10199:
URL: https://github.com/apache/hudi/pull/10199#discussion_r1454154841


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.config;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import javax.annotation.concurrent.Immutable;
+
+import static 
org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX;
+import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX;
+
+/**
+ * Parquet DFS Source Configs
+ */
+@Immutable
+@ConfigClassProperty(name = "Parquet DFS Source Configs",
+groupName = ConfigGroups.Names.HUDI_STREAMER,
+subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE,
+description = "Configurations controlling the behavior of Parquet DFS 
source in Hudi Streamer.")
+public class ParquetDFSSourceConfig extends HoodieConfig {
+
+public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = 
ConfigProperty
+.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema")
+.defaultValue(true)

Review Comment:
   ![Screenshot 2024-01-16 at 4 38 21 
PM](https://github.com/apache/hudi/assets/2701446/9c6730f8-e9f1-41ab-988c-f6242ec8e523)
   
   did a quick check on the doc so it's default false. setting this true will 
introduce behavior changes. we should keep it BWC in pre 1.0 releases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub


yihua commented on code in PR #10199:
URL: https://github.com/apache/hudi/pull/10199#discussion_r1454147802


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.config;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import javax.annotation.concurrent.Immutable;
+
+import static 
org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX;
+import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX;
+
+/**
+ * Parquet DFS Source Configs
+ */
+@Immutable
+@ConfigClassProperty(name = "Parquet DFS Source Configs",
+groupName = ConfigGroups.Names.HUDI_STREAMER,
+subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE,
+description = "Configurations controlling the behavior of Parquet DFS 
source in Hudi Streamer.")
+public class ParquetDFSSourceConfig extends HoodieConfig {
+
+public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = 
ConfigProperty
+.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema")

Review Comment:
   Avoid camelCase in the config naming.  use `.enable_merge_schema` instead.



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.config;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import javax.annotation.concurrent.Immutable;
+
+import static 
org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX;
+import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX;
+
+/**
+ * Parquet DFS Source Configs
+ */
+@Immutable
+@ConfigClassProperty(name = "Parquet DFS Source Configs",
+groupName = ConfigGroups.Names.HUDI_STREAMER,
+subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE,
+description = "Configurations controlling the behavior of Parquet DFS 
source in Hudi Streamer.")
+public class ParquetDFSSourceConfig extends HoodieConfig {
+
+public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = 
ConfigProperty
+.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema")
+.defaultValue(true)
+.withAlternatives(DELTA_STREAMER_CONFIG_PREFIX + 
"source.parquet.dfs.mergeSchema")
+.markAdvanced()

Review Comment:
   add `sinceVersion("1.0.0")`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub


rohitmittapalli commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1894619834

   > @rohitmittapalli can you also file a jira and update the title with the 
jira id pls?
   
   Requested a JIRA account unable to file until that gets approved


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub


rohitmittapalli commented on code in PR #10199:
URL: https://github.com/apache/hudi/pull/10199#discussion_r1454133265


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.config;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import javax.annotation.concurrent.Immutable;
+
+import static 
org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX;
+import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX;
+
+/**
+ * Parquet DFS Source Configs
+ */
+@Immutable
+@ConfigClassProperty(name = "Parquet DFS Source Configs",
+groupName = ConfigGroups.Names.HUDI_STREAMER,
+subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE,
+description = "Configurations controlling the behavior of Parquet DFS 
source in Hudi Streamer.")
+public class ParquetDFSSourceConfig extends HoodieConfig {
+
+public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = 
ConfigProperty
+.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema")
+.defaultValue(true)

Review Comment:
   I've set default to true as per @nsivabalan's request here: 
https://github.com/apache/hudi/pull/10199#discussion_r1408722685
   
   Essentially the key difference is that the schema will be merged across all 
the parquet files in the commit, in the past the schema would be inherited by 
the first file in the commit. In my opinion, this should be the default case. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub


xushiyan commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1894614464

   @rohitmittapalli can you also file a jira and update the title with the jira 
id pls?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-16 Thread via GitHub


xushiyan commented on code in PR #10199:
URL: https://github.com/apache/hudi/pull/10199#discussion_r1454129825


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/ParquetDFSSourceConfig.java:
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.config;
+
+import org.apache.hudi.common.config.ConfigClassProperty;
+import org.apache.hudi.common.config.ConfigGroups;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.HoodieConfig;
+
+import javax.annotation.concurrent.Immutable;
+
+import static 
org.apache.hudi.common.util.ConfigUtils.DELTA_STREAMER_CONFIG_PREFIX;
+import static org.apache.hudi.common.util.ConfigUtils.STREAMER_CONFIG_PREFIX;
+
+/**
+ * Parquet DFS Source Configs
+ */
+@Immutable
+@ConfigClassProperty(name = "Parquet DFS Source Configs",
+groupName = ConfigGroups.Names.HUDI_STREAMER,
+subGroupName = ConfigGroups.SubGroupNames.DELTA_STREAMER_SOURCE,
+description = "Configurations controlling the behavior of Parquet DFS 
source in Hudi Streamer.")
+public class ParquetDFSSourceConfig extends HoodieConfig {
+
+public static final ConfigProperty PARQUET_DFS_MERGE_SCHEMA = 
ConfigProperty
+.key(STREAMER_CONFIG_PREFIX + "source.parquet.dfs.mergeSchema")
+.defaultValue(true)

Review Comment:
   can you clarify by setting this default to true, what is the impact to 
existing pipelines that using this DFS source? should it be false by default to 
be compatible?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-08 Thread via GitHub


hudi-bot commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1882481681

   
   ## CI report:
   
   * 9c61cc3b1ff124314bb7cacb82bb141762678d54 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21878)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-08 Thread via GitHub


hudi-bot commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1882430073

   
   ## CI report:
   
   *  Unknown: [CANCELED](TBD) 
   * 9c61cc3b1ff124314bb7cacb82bb141762678d54 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21878)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-08 Thread via GitHub


hudi-bot commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1882355671

   
   ## CI report:
   
   * 378a6a619dc288301c70275483bbb0ecfa73a7f1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21875)
 
   *  Unknown: [CANCELED](TBD) 
   * 9c61cc3b1ff124314bb7cacb82bb141762678d54 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21878)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-08 Thread via GitHub


hudi-bot commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1882339654

   
   ## CI report:
   
   * 378a6a619dc288301c70275483bbb0ecfa73a7f1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21875)
 
   *  Unknown: [CANCELED](TBD) 
   * 9c61cc3b1ff124314bb7cacb82bb141762678d54 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-08 Thread via GitHub


rohitmittapalli commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1882325117

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-08 Thread via GitHub


hudi-bot commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1882239471

   
   ## CI report:
   
   * f7566099db43c39a06db5e4ae905a65dfd69a7ca Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21190)
 
   * 378a6a619dc288301c70275483bbb0ecfa73a7f1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21875)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-08 Thread via GitHub


hudi-bot commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1882224728

   
   ## CI report:
   
   * f7566099db43c39a06db5e4ae905a65dfd69a7ca Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21190)
 
   * 378a6a619dc288301c70275483bbb0ecfa73a7f1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-08 Thread via GitHub


hudi-bot commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1882210046

   
   ## CI report:
   
   * f7566099db43c39a06db5e4ae905a65dfd69a7ca Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21190)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2024-01-08 Thread via GitHub


rohitmittapalli commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1882191866

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2023-11-28 Thread via GitHub


nsivabalan commented on code in PR #10199:
URL: https://github.com/apache/hudi/pull/10199#discussion_r1408722685


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/ParquetDFSSource.java:
##
@@ -52,6 +52,6 @@ public Pair>, String> 
fetchNextBatch(Option lastCkpt
   }
 
   private Dataset fromFiles(String pathStr) {
-return sparkSession.read().parquet(pathStr.split(","));
+return sparkSession.read().option("mergeSchema", 
"true").parquet(pathStr.split(","));

Review Comment:
   Can we add a config property for this and enable based on that. 
   you can introduce a new Config class named ParquetDFSSourceConfig and add a 
config property for MergeSchema. 
   set default to true. 
   
   you can take a look at 
https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/config/DFSPathSelectorConfig.java
 for reference. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2023-11-28 Thread via GitHub


hudi-bot commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1831064240

   
   ## CI report:
   
   * f7566099db43c39a06db5e4ae905a65dfd69a7ca Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21190)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2023-11-28 Thread via GitHub


hudi-bot commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1830876966

   
   ## CI report:
   
   * f7566099db43c39a06db5e4ae905a65dfd69a7ca Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21190)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2023-11-28 Thread via GitHub


hudi-bot commented on PR #10199:
URL: https://github.com/apache/hudi/pull/10199#issuecomment-1830868076

   
   ## CI report:
   
   * f7566099db43c39a06db5e4ae905a65dfd69a7ca UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Merge schema in ParuqetDFSSource [hudi]

2023-11-28 Thread via GitHub


rohitmittapalli commented on code in PR #10199:
URL: https://github.com/apache/hudi/pull/10199#discussion_r1408463667


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/ParquetDFSSource.java:
##
@@ -32,7 +32,7 @@
 /**
  * DFS Source that reads parquet data.
  */
-public class ParquetDFSSource extends RowSource {
+πpublic class ParquetDFSSource extends RowSource {

Review Comment:
   ```suggestion
   public class ParquetDFSSource extends RowSource {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Merge schema in ParuqetDFSSource [hudi]

2023-11-28 Thread via GitHub


rohitmittapalli opened a new pull request, #10199:
URL: https://github.com/apache/hudi/pull/10199

   ### Change Logs
   
   ParquetDFSSource will merge the schema across files in a particular read.
   
   ### Impact
   
   ParquetDFSSource will merge the schema across files in a particular read.
   
   ### Risk level (write none, low medium or high below)
   
   Low
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [X] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [X] Change Logs and Impact were stated clearly
   - [X] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org