Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780459956 ## CI report: * 6972591365be4bde76c7b41dc5122c63ffd18c79 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20495) * d5132f11b2cd4fff06c286ef9741dbaa80fa0463 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20500) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6989] Stop handling more data if task is aborted & clean partial files if possible in task side [hudi]
hudi-bot commented on PR #9922: URL: https://github.com/apache/hudi/pull/9922#issuecomment-1780453606 ## CI report: * a9b3bd0f7aa15a9fbef3caaae798aa34790e027a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20499) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780453442 ## CI report: * 6972591365be4bde76c7b41dc5122c63ffd18c79 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20495) * d5132f11b2cd4fff06c286ef9741dbaa80fa0463 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6989] Stop handling more data if task is aborted & clean partial files if possible in task side [hudi]
hudi-bot commented on PR #9922: URL: https://github.com/apache/hudi/pull/9922#issuecomment-1780446929 ## CI report: * a9b3bd0f7aa15a9fbef3caaae798aa34790e027a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
hudi-bot commented on PR #9834: URL: https://github.com/apache/hudi/pull/9834#issuecomment-1780446730 ## CI report: * de2b5c95028029ff06d1f360763ca3f83c661ff3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20497) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6989] Stop handling more data if task is aborted & clean partial files if possible in task side [hudi]
danny0405 commented on PR #9922: URL: https://github.com/apache/hudi/pull/9922#issuecomment-1780425889 Thanks for the fix, from high-level, I kind of think we should avoid to relies on the Spark mechanisms to add any rollback/cleaning improvement here, it's hacky to maintain and it is not tenable for all engines. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6989] Stop handling more data if task is aborted & clean partial files if possible in task side [hudi]
danny0405 commented on code in PR #9922: URL: https://github.com/apache/hudi/pull/9922#discussion_r1372590546 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java: ## @@ -272,11 +272,12 @@ private static Path makeNewPath(FileSystem fs, String partitionPath, String file * * @param partitionPath Partition path */ - private static void createMarkerFile(String partitionPath, + private void createMarkerFile(String partitionPath, String dataFileName, String instantTime, HoodieTable table, HoodieWriteConfig writeConfig) { +stopIfAborted(); Review Comment: Is creating marker file the right time to abort the tasks ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6946] Data Duplicates with range pruning while using hoodie.bloom.index.use.metadata [hudi]
danny0405 commented on PR #9886: URL: https://github.com/apache/hudi/pull/9886#issuecomment-1780419088 Nice catch @xicm , We may need to check whether 'config.getColumnsEnabledForColumnStatsIndex()' contains the `hoodie.table.recordkey.fields` field, - if 'config.getColumnsEnabledForColumnStatsIndex()' is empty,that means all the fields(including the metadata fields)are indexed in col_stats,then we can still use `hoodie.table.recordkey.fields` (caution that if `hoodie.table.recordkey.fields` is not configured,we can fallback to `_hoodie_record_key`); - if not empty,we need to check whether `hoodie.table.recordkey.fields` is included in the col_stats,use it if if `hoodie.table.recordkey.fields` is included and throws exception otherwise. It's great if we can supplement some test cases that mentioned in https://github.com/apache/hudi/issues/9870 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6989) Stop handling more data if task is aborted & clean partial files if possible in task side
[ https://issues.apache.org/jira/browse/HUDI-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui An reassigned HUDI-6989: Assignee: Hui An > Stop handling more data if task is aborted & clean partial files if possible > in task side > - > > Key: HUDI-6989 > URL: https://issues.apache.org/jira/browse/HUDI-6989 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Hui An >Assignee: Hui An >Priority: Major > Labels: pull-request-available > > Spark would set interrupt status in TaskContext if the task is aborted, HUDI > needs to respect that to stop immediately. Also, we can clean partial files > at task side to ensure these files won't be left. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6989) Stop handling more data if task is aborted & clean partial files if possible in task side
[ https://issues.apache.org/jira/browse/HUDI-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6989: - Labels: pull-request-available (was: ) > Stop handling more data if task is aborted & clean partial files if possible > in task side > - > > Key: HUDI-6989 > URL: https://issues.apache.org/jira/browse/HUDI-6989 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Hui An >Priority: Major > Labels: pull-request-available > > Spark would set interrupt status in TaskContext if the task is aborted, HUDI > needs to respect that to stop immediately. Also, we can clean partial files > at task side to ensure these files won't be left. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-6989] Stop handling more data if task is aborted & clean partial files if possible in task side [hudi]
boneanxs opened a new pull request, #9922: URL: https://github.com/apache/hudi/pull/9922 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ 1. `HoodieWriteHandle` needs to stop immediately if task is failed 2. `TaskContextSupplier` adds a status to identify whether the task is failed. 3. Clean files at task side if it fails When `Executor` tries to kill a task, spark will set the `TaskContext` as interrupted and interrupt the task thread. Spark internally holds an `InterruptibleIterator` to monitor task status and kill task if it's interrupted, but we can usually see e.g. `HoodieMergeHandle` still keep writing data even if executor already tries to kill it(see below). The reason is that during `init` in `HoodieMergeHandle`, it will first iterator `InterruptibleIterator` and build `keyToNewRecords`, if the kill status is coming after the `init` method, then `HoodieMergeHandle` might still write new records. ```java 23/03/23 02:28:45 INFO HoodieMergeHandle: MaxMemoryPerPartitionMerge => 1073741824 23/03/23 02:28:46 INFO Executor: Executor is trying to kill task 2.1 in stage 11.0 (TID 1471), reason: another attempt succeeded 23/03/23 02:28:46 INFO Executor: Executor is trying to kill task 2.1 in stage 11.0 (TID 1471), reason: Stage finished 23/03/23 02:28:47 INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 0, Total size in bytes of MemoryBasedMap => 0, Number of entries in BitCaskDiskMap => 0, Size of file spilled to disk => 0 23/03/23 02:28:47 INFO HoodieMergeHandle: partitionPath:grass_region=test, fileId to be merged:d3ee8406-4011-44a4-8913-8be0349a6686-0 ``` Btw, though in `BaseHoodieQueueBasedExecutor.execute` hudi could exit immediately if the task thread is interrupted, but in `BaseHoodieQueueBasedExecutor.awaitTermination`, hudi will clean `interrupt` exception and wait extra `60` sec to let the executor to proceed. So the task will wait at least 60s when it's been killed. We need to avoid that. ### Impact _Describe any public API or user-facing feature change or any performance impact._ None ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ None ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
hudi-bot commented on PR #9834: URL: https://github.com/apache/hudi/pull/9834#issuecomment-1780406926 ## CI report: * d28ebc812328746cb530a35db70df43e67c6ffc2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20250) * de2b5c95028029ff06d1f360763ca3f83c661ff3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20497) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6949] Spark support non-blocking concurrency control [hudi]
hudi-bot commented on PR #9921: URL: https://github.com/apache/hudi/pull/9921#issuecomment-1780401311 ## CI report: * 00152b4450f2453c6b37f26dde9cfc19fe865425 UNKNOWN * 8401cb3b04bea4ac0388d33ed40ad5853a5b7090 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20496) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Trino can't read tables created by Flink Hudi conector [hudi]
danny0405 commented on issue #9435: URL: https://github.com/apache/hudi/issues/9435#issuecomment-1780393004 Were you capable to debug the local fs test failures? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [bug] [fatal error] Severe bug seems to be a deadlock in the position of the BucketWrite Operator. [hudi]
danny0405 commented on issue #9917: URL: https://github.com/apache/hudi/issues/9917#issuecomment-1780391330 Did you enable the checkpoint ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6969] Add speed limit for stream read [hudi]
danny0405 commented on code in PR #9904: URL: https://github.com/apache/hudi/pull/9904#discussion_r1372556576 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestData.java: ## @@ -503,6 +503,20 @@ public static String rowDataToString(List rows) { public static void writeData( List dataBuffer, Configuration conf) throws Exception { +writeData(dataBuffer, conf, 1); + } + + /** + * Write a list of row data with Hoodie format base on the given configuration. + * + * @param dataBuffer The data buffer to write + * @param conf The flink configuration + * @param ckpId The checkpoint id + * @throws Exception if error occurs + */ + public static void writeData( + List dataBuffer, Review Comment: The checkpoint id does not affect the data write, there is no need to specify it explicitly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
hudi-bot commented on PR #9834: URL: https://github.com/apache/hudi/pull/9834#issuecomment-1780377139 ## CI report: * d28ebc812328746cb530a35db70df43e67c6ffc2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20250) * de2b5c95028029ff06d1f360763ca3f83c661ff3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6821) Make multiple base file formats within each file group.
[ https://issues.apache.org/jira/browse/HUDI-6821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-6821. - Resolution: Done > Make multiple base file formats within each file group. > --- > > Key: HUDI-6821 > URL: https://issues.apache.org/jira/browse/HUDI-6821 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Ability to mix different types of base files within a single table or even a > single file group (e.g images, json, vectors ...) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6821] Support multiple base file formats in Hudi table (#9761)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 8bf44c01b56 [HUDI-6821] Support multiple base file formats in Hudi table (#9761) 8bf44c01b56 is described below commit 8bf44c01b56dd3afe5323dc7566971cee2e46d50 Author: Sagar Sumit AuthorDate: Thu Oct 26 09:27:02 2023 +0530 [HUDI-6821] Support multiple base file formats in Hudi table (#9761) --- .../org/apache/hudi/config/HoodieWriteConfig.java | 11 +- .../java/org/apache/hudi/io/HoodieWriteHandle.java | 3 +- .../java/org/apache/hudi/table/HoodieTable.java| 10 +- .../table/action/bootstrap/BootstrapUtils.java | 9 +- ...sistentHashingBucketClusteringPlanStrategy.java | 4 +- .../rollback/ListingBasedRollbackStrategy.java | 6 +- .../table/upgrade/ZeroToOneUpgradeHandler.java | 7 +- .../io/storage/row/HoodieRowDataCreateHandle.java | 4 +- .../client/TestHoodieJavaWriteClientInsert.java| 4 +- .../hudi/client/TestJavaHoodieBackedMetadata.java | 5 - .../TestHoodieJavaClientOnCopyOnWriteStorage.java | 3 +- .../commit/TestJavaCopyOnWriteActionExecutor.java | 4 +- .../testutils/HoodieJavaClientTestHarness.java | 4 + .../SparkBootstrapCommitActionExecutor.java| 2 +- .../TestHoodieClientOnCopyOnWriteStorage.java | 14 +- .../table/action/bootstrap/TestBootstrapUtils.java | 12 +- .../commit/TestCopyOnWriteActionExecutor.java | 5 +- .../TestHoodieSparkMergeOnReadTableRollback.java | 2 +- .../hudi/testutils/HoodieClientTestBase.java | 5 + .../testutils/HoodieSparkClientTestHarness.java| 5 - .../apache/hudi/common/model/HoodieFileFormat.java | 9 + .../hudi/common/table/HoodieTableConfig.java | 10 + .../hudi/common/table/HoodieTableMetaClient.java | 19 +- .../org/apache/hudi/common/util/BaseFileUtils.java | 5 - .../org/apache/hudi/common/fs/TestFSUtils.java | 27 ++ .../hudi/common/testutils/HoodieTestTable.java | 3 +- .../org/apache/hudi/BaseFileOnlyRelation.scala | 4 +- .../main/scala/org/apache/hudi/DefaultSource.scala | 52 ++-- .../scala/org/apache/hudi/HoodieBaseRelation.scala | 107 +++- ...tils.scala => HoodieSparkFileFormatUtils.scala} | 35 +-- .../scala/org/apache/hudi/HoodieWriterUtils.scala | 9 +- .../hudi/MergeOnReadIncrementalRelation.scala | 4 +- .../apache/hudi/MergeOnReadSnapshotRelation.scala | 92 --- .../sql/catalyst/catalog/HoodieCatalogTable.scala | 4 +- .../datasources/HoodieMultipleBaseFileFormat.scala | 278 + .../spark/sql/hudi/ProvidesHoodieConfig.scala | 2 +- .../RepairMigratePartitionMetaProcedure.scala | 2 +- .../org/apache/hudi/functional/TestBootstrap.java | 8 +- .../apache/hudi/functional/TestOrcBootstrap.java | 8 +- .../apache/hudi/testutils/DataSourceTestUtils.java | 20 +- .../TestHoodieMultipleBaseFileFormat.scala | 123 + .../datasources/Spark32NestedSchemaPruning.scala | 3 +- .../hudi/utilities/streamer/HoodieStreamer.java| 10 +- 43 files changed, 712 insertions(+), 241 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java index cc3876338cc..5ae7ab25fbd 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java @@ -219,7 +219,7 @@ public class HoodieWriteConfig extends HoodieConfig { + "the timeline as an immutable log relying only on atomic writes for object storage."); public static final ConfigProperty BASE_FILE_FORMAT = ConfigProperty - .key("hoodie.table.base.file.format") + .key("hoodie.base.file.format") .defaultValue(HoodieFileFormat.PARQUET) .withValidValues(HoodieFileFormat.PARQUET.name(), HoodieFileFormat.ORC.name(), HoodieFileFormat.HFILE.name()) .withAlternatives("hoodie.table.ro.file.format") @@ -1198,6 +1198,10 @@ public class HoodieWriteConfig extends HoodieConfig { return getString(BASE_PATH); } + public HoodieFileFormat getBaseFileFormat() { +return HoodieFileFormat.valueOf(getStringOrDefault(BASE_FILE_FORMAT)); + } + public HoodieRecordMerger getRecordMerger() { List mergers = StringUtils.split(getStringOrDefault(RECORD_MERGER_IMPLS), ",").stream() .map(String::trim) @@ -2705,6 +2709,11 @@ public class HoodieWriteConfig extends HoodieConfig { return this; } +public Builder withBaseFileFormat(String baseFileFormat) { + writeConfig.setValue(BASE_FILE_FORMAT, HoodieFileFormat.valueOf(baseFileFormat).name()); + return this; +} + public Builder
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
codope merged PR #9761: URL: https://github.com/apache/hudi/pull/9761 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6989) Stop handling more data if task is aborted & clean partial files if possible in task side
[ https://issues.apache.org/jira/browse/HUDI-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hui An updated HUDI-6989: - Summary: Stop handling more data if task is aborted & clean partial files if possible in task side (was: Stop ingesting more data if task is aborted & clean partial files if possible in task side) > Stop handling more data if task is aborted & clean partial files if possible > in task side > - > > Key: HUDI-6989 > URL: https://issues.apache.org/jira/browse/HUDI-6989 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Hui An >Priority: Major > > Spark would set interrupt status in TaskContext if the task is aborted, HUDI > needs to respect that to stop immediately. Also, we can clean partial files > at task side to ensure these files won't be left. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6989) Stop ingesting more data if task is aborted & clean partial files if possible in task side
Hui An created HUDI-6989: Summary: Stop ingesting more data if task is aborted & clean partial files if possible in task side Key: HUDI-6989 URL: https://issues.apache.org/jira/browse/HUDI-6989 Project: Apache Hudi Issue Type: Improvement Reporter: Hui An Spark would set interrupt status in TaskContext if the task is aborted, HUDI needs to respect that to stop immediately. Also, we can clean partial files at task side to ensure these files won't be left. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6949] Spark support non-blocking concurrency control [hudi]
hudi-bot commented on PR #9921: URL: https://github.com/apache/hudi/pull/9921#issuecomment-1780371965 ## CI report: * 00152b4450f2453c6b37f26dde9cfc19fe865425 UNKNOWN * 8401cb3b04bea4ac0388d33ed40ad5853a5b7090 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
nsivabalan commented on code in PR #9743: URL: https://github.com/apache/hudi/pull/9743#discussion_r1372547231 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java: ## @@ -173,7 +175,12 @@ private RecordIterator(Schema readerSchema, Schema writerSchema, byte[] content) this.totalRecords = this.dis.readInt(); } - this.reader = new GenericDatumReader<>(writerSchema, readerSchema); + if (recordNeedsRewriteForExtendedAvroTypePromotion(writerSchema, readerSchema)) { Review Comment: again, lets take an informed decision if we want to do it in this patch or a follow up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6923] Fixing bug with sanitization for rowSource [hudi]
harsh1231 commented on code in PR #9834: URL: https://github.com/apache/hudi/pull/9834#discussion_r1372547228 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/SanitizationUtils.java: ## @@ -120,6 +120,11 @@ public static Dataset sanitizeColumnNamesForAvro(Dataset inputDataset, return targetDataset; } + public static Dataset sanitizeColumnNamesForAvro(Dataset inputDataset, TypedProperties props) { Review Comment: This test has coverage for above method https://github.com/apache/hudi/blob/de2b5c95028029ff06d1f360763ca3f83c661ff3/hudi-utilities/src/test/java/org/apache/hudi/utilities/deltastreamer/TestSourceFormatAdapter.java#L131 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]
danny0405 commented on PR #9911: URL: https://github.com/apache/hudi/pull/9911#issuecomment-1780367765 Retriggered the failed tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6949] Spark support non-blocking concurrency control [hudi]
hudi-bot commented on PR #9921: URL: https://github.com/apache/hudi/pull/9921#issuecomment-1780366795 ## CI report: * 00152b4450f2453c6b37f26dde9cfc19fe865425 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
danny0405 commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1372535469 ## hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java: ## @@ -160,9 +169,11 @@ public Map generateMetadataForRecord( * @param schema The Avro schema of the record. * @return A mapping containing the metadata. */ - public Map generateMetadataForRecord(T record, Schema schema) { + public Map generateMetadataForRecord(T record, Schema schema, boolean isPartial) { Map meta = new HashMap<>(); meta.put(INTERNAL_META_RECORD_KEY, getRecordKey(record, schema)); +meta.put(INTERNAL_META_SCHEMA, schema); +meta.put(INTERNAL_META_IS_PARTIAL, isPartial); Review Comment: I'm wondering whether we can represent the metadata as a POJO to make the interface more explicit and clear. ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/BaseSparkInternalRowReaderContext.java: ## @@ -94,16 +94,17 @@ public Comparable getOrderingValue(Option rowOption, @Override public HoodieRecord constructHoodieRecord(Option rowOption, - Map metadataMap, - Schema schema) { + Map metadataMap) { if (!rowOption.isPresent()) { return new HoodieEmptyRecord<>( new HoodieKey((String) metadataMap.get(INTERNAL_META_RECORD_KEY), (String) metadataMap.get(INTERNAL_META_PARTITION_PATH)), HoodieRecord.HoodieRecordType.SPARK); } +Schema schema = (Schema) metadataMap.get(INTERNAL_META_SCHEMA); InternalRow row = rowOption.get(); +boolean isPartial = (boolean) metadataMap.getOrDefault(INTERNAL_META_IS_PARTIAL, false); return new HoodieSparkRecord(row, HoodieInternalRowUtils.getCachedSchema(schema)); Review Comment: The `isPartial` is never used. ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala: ## @@ -51,6 +61,28 @@ class SparkFileFormatInternalRowReaderContext(baseFileReader: PartitionedFile => requiredSchema: Schema, conf: Configuration): ClosableIterator[InternalRow] = { val fileInfo = sparkAdapter.getSparkPartitionedFileUtils.createPartitionedFile(partitionValues, filePath, start, length) -new CloseableInternalRowIterator(baseFileReader.apply(fileInfo)) +if (filePath.toString.contains(HoodieLogFile.DELTA_EXTENSION)) { Review Comment: Use `FsUtils.isLogFile` instead. ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java: ## @@ -151,7 +151,7 @@ public HoodieFileGroupReader(HoodieReaderContext readerContext, public void initRecordIterators() { this.baseFileIterator = baseFilePath.isPresent() ? readerContext.getFileRecordIterator( -baseFilePath.get().getHadoopPath(), start, length, readerState.baseFileAvroSchema, readerState.baseFileAvroSchema, hadoopConf) +baseFilePath.get().getHadoopPath(), start, length, readerState.baseFileAvroSchema, readerState.baseFileAvroSchema, hadoopConf) Review Comment: Unnecessary change? ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieKeyBasedFileGroupRecordBuffer.java: ## @@ -127,10 +124,12 @@ public boolean hasNext() throws IOException { String recordKey = readerContext.getRecordKey(baseRecord, baseFileSchema); Pair, Map> logRecordInfo = records.remove(recordKey); + Map metadata = readerContext.generateMetadataForRecord( + baseRecord, baseFileSchema, false); Review Comment: Caution for the performace regression for per-record metadata construction. ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/HoodieSparkRecordMerger.java: ## @@ -70,9 +71,11 @@ public Option> merge(HoodieRecord older, Schema oldSc } } if (older.getOrderingValue(oldSchema, props).compareTo(newer.getOrderingValue(newSchema, props)) > 0) { - return Option.of(Pair.of(older, oldSchema)); + return Option.of(SparkPartialMergingUtils.mergePartialRecords( + (HoodieSparkRecord) newer, newSchema, (HoodieSparkRecord) older, oldSchema, props)); Review Comment: The partial merge may not happen, so maybe give the utility a better name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
nsivabalan commented on code in PR #9743: URL: https://github.com/apache/hudi/pull/9743#discussion_r1372446787 ## hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java: ## @@ -116,9 +116,24 @@ public static String getAvroRecordQualifiedName(String tableName) { return "hoodie." + sanitizedTableName + "." + sanitizedTableName + "_record"; } + /** + * Validate whether the {@code targetSchema} is a valid evolution of {@code sourceSchema}. + * Basically {@link #isCompatibleProjectionOf(Schema, Schema)} but type promotion in the + * opposite direction + */ + public static boolean isValidEvolutionOf(Schema sourceSchema, Schema targetSchema) { +return (sourceSchema.getType() == Schema.Type.NULL) || isProjectionOfInternal(sourceSchema, targetSchema, +AvroSchemaUtils::isAtomicSchemasCompatibleEvolution); + } + + private static boolean isAtomicSchemasCompatibleEvolution(Schema oneAtomicType, Schema anotherAtomicType) { Review Comment: can we write extensive docs on these methods. in general we have not been very comfortable in touching these part of code. may be meng tao and few others are, but rest of the PMCs generally have been very cautious. Can you add more docs around these methods so its easier for maintenance going forward ## hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java: ## @@ -79,6 +81,14 @@ public class HoodieCommonConfig extends HoodieConfig { + " operation will fail schema compatibility check. Set this option to true will make the newly added " + " column nullable to successfully complete the write operation."); + public static final ConfigProperty ADD_NULL_FOR_DELETED_COLUMNS = ConfigProperty + .key("hoodie.datasource.add.null.for.deleted.columns") Review Comment: "hoodie.datasource.set.null.for.missing.columns" ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java: ## @@ -173,7 +175,12 @@ private RecordIterator(Schema readerSchema, Schema writerSchema, byte[] content) this.totalRecords = this.dis.readInt(); } - this.reader = new GenericDatumReader<>(writerSchema, readerSchema); + if (recordNeedsRewriteForExtendedAvroTypePromotion(writerSchema, readerSchema)) { Review Comment: We should try and unify our convention across the code base. We use reader and writer schema here. we use table schema and source schema in outer layers. we use prevSchema in some cases. sourceSchema and targetSchema in few other places. We should try to align all these and use a standard terminology throughout. May be reader and writer in write handle classes. and source Schema and targetSchema else where. ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java: ## @@ -359,7 +360,8 @@ public Option> getLastCommitMetadataWi return Option.fromJavaOptional( getCommitMetadataStream() .filter(instantCommitMetadataPair -> - !StringUtils.isNullOrEmpty(instantCommitMetadataPair.getValue().getMetadata(HoodieCommitMetadata.SCHEMA_KEY))) + !StringUtils.isNullOrEmpty(instantCommitMetadataPair.getValue().getMetadata(HoodieCommitMetadata.SCHEMA_KEY)) +&& !WriteOperationType.schemaCantChange(instantCommitMetadataPair.getRight().getOperationType())) Review Comment: minor. can you switch the order of conditions. lets first check for operation type. and then check for SCHEMA_KEY in extra metadata ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSchemaUtils.scala: ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi + +import org.apache.hudi.common.config.HoodieConfig +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.internal.schema.InternalSchema + +/** + * Util methods for Schema evolution in Hudi + */ +object HoodieSchemaUtils { + /** + * get latest internalSchema from table + * + * @param config
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
codope commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1372536310 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/HoodieMultipleBaseFileFormat.scala: ## @@ -0,0 +1,278 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.mapreduce.Job +import org.apache.hudi.DataSourceReadOptions.{REALTIME_PAYLOAD_COMBINE_OPT_VAL, REALTIME_SKIP_MERGE_OPT_VAL} +import org.apache.hudi.MergeOnReadSnapshotRelation.createPartitionedFile +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.{FileSlice, HoodieLogFile} +import org.apache.hudi.{HoodieBaseRelation, HoodieTableSchema, HoodieTableState, LogFileIterator, MergeOnReadSnapshotRelation, PartitionFileSliceMapping, RecordMergingFileIterator, SparkAdapterSupport} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.sql.HoodieCatalystExpressionUtils.generateUnsafeProjection +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.JoinedRow +import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat +import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.types.{StructField, StructType} +import org.apache.spark.util.SerializableConfiguration + +import scala.collection.mutable +import scala.jdk.CollectionConverters.asScalaIteratorConverter + +/** + * File format that supports reading multiple base file formats in a table. + */ +class HoodieMultipleBaseFileFormat(tableState: Broadcast[HoodieTableState], + tableSchema: Broadcast[HoodieTableSchema], + tableName: String, + mergeType: String, + mandatoryFields: Seq[String], + isMOR: Boolean) extends FileFormat with SparkAdapterSupport { + private val parquetFormat = new ParquetFileFormat() + private val orcFormat = new OrcFileFormat() + + override def inferSchema(sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = { +// This is a simple heuristic assuming all files have the same extension. +val fileFormat = detectFileFormat(files.head.getPath.toString) + +fileFormat match { + case "parquet" => parquetFormat.inferSchema(sparkSession, options, files) + case "orc" => orcFormat.inferSchema(sparkSession, options, files) + case _ => throw new UnsupportedOperationException(s"File format $fileFormat is not supported.") +} + } + + override def isSplitable(sparkSession: SparkSession, options: Map[String, String], path: Path): Boolean = { +false + } + + // Used so that the planner only projects once and does not stack overflow + var isProjected = false + + /** + * Support batch needs to remain consistent, even if one side of a bootstrap merge can support + * while the other side can't + */ + private var supportBatchCalled = false + private var supportBatchResult = false + + override def supportBatch(sparkSession: SparkSession, schema: StructType): Boolean = { +if (!supportBatchCalled) { + supportBatchCalled = true + supportBatchResult = +!isMOR && parquetFormat.supportBatch(sparkSession, schema) && orcFormat.supportBatch(sparkSession, schema) +} +supportBatchResult + } + + override def prepareWrite(sparkSession: SparkSession, +job: Job, +options: Map[String, String], +dataSchema: StructType): OutputWriterFactory = { +throw new UnsupportedOperationException("Write operations are not supported in this example.") + } + + override def buildReaderWithPartitionValues(sparkSession: SparkSession, +
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
codope commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1372534307 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/HoodieMultipleBaseFileFormat.scala: ## @@ -0,0 +1,278 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.mapreduce.Job +import org.apache.hudi.DataSourceReadOptions.{REALTIME_PAYLOAD_COMBINE_OPT_VAL, REALTIME_SKIP_MERGE_OPT_VAL} +import org.apache.hudi.MergeOnReadSnapshotRelation.createPartitionedFile +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.{FileSlice, HoodieLogFile} +import org.apache.hudi.{HoodieBaseRelation, HoodieTableSchema, HoodieTableState, LogFileIterator, MergeOnReadSnapshotRelation, PartitionFileSliceMapping, RecordMergingFileIterator, SparkAdapterSupport} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.sql.HoodieCatalystExpressionUtils.generateUnsafeProjection +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.JoinedRow +import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat +import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.types.{StructField, StructType} +import org.apache.spark.util.SerializableConfiguration + +import scala.collection.mutable +import scala.jdk.CollectionConverters.asScalaIteratorConverter + +/** + * File format that supports reading multiple base file formats in a table. + */ +class HoodieMultipleBaseFileFormat(tableState: Broadcast[HoodieTableState], + tableSchema: Broadcast[HoodieTableSchema], + tableName: String, + mergeType: String, + mandatoryFields: Seq[String], + isMOR: Boolean) extends FileFormat with SparkAdapterSupport { + private val parquetFormat = new ParquetFileFormat() + private val orcFormat = new OrcFileFormat() + + override def inferSchema(sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = { +// This is a simple heuristic assuming all files have the same extension. +val fileFormat = detectFileFormat(files.head.getPath.toString) + +fileFormat match { + case "parquet" => parquetFormat.inferSchema(sparkSession, options, files) + case "orc" => orcFormat.inferSchema(sparkSession, options, files) + case _ => throw new UnsupportedOperationException(s"File format $fileFormat is not supported.") +} + } + + override def isSplitable(sparkSession: SparkSession, options: Map[String, String], path: Path): Boolean = { +false + } + + // Used so that the planner only projects once and does not stack overflow + var isProjected = false + + /** + * Support batch needs to remain consistent, even if one side of a bootstrap merge can support + * while the other side can't + */ + private var supportBatchCalled = false + private var supportBatchResult = false + + override def supportBatch(sparkSession: SparkSession, schema: StructType): Boolean = { +if (!supportBatchCalled) { + supportBatchCalled = true + supportBatchResult = +!isMOR && parquetFormat.supportBatch(sparkSession, schema) && orcFormat.supportBatch(sparkSession, schema) +} +supportBatchResult + } + + override def prepareWrite(sparkSession: SparkSession, +job: Job, +options: Map[String, String], +dataSchema: StructType): OutputWriterFactory = { +throw new UnsupportedOperationException("Write operations are not supported in this example.") + } + + override def buildReaderWithPartitionValues(sparkSession: SparkSession, +
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
codope commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1372535019 ## hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java: ## @@ -2763,10 +2762,6 @@ private void validateMetadata(HoodieJavaWriteClient testClient, Option i // Metadata table is MOR assertEquals(metadataMetaClient.getTableType(), HoodieTableType.MERGE_ON_READ, "Metadata Table should be MOR"); - // Metadata table is HFile format - assertEquals(metadataMetaClient.getTableConfig().getBaseFileFormat(), HoodieFileFormat.HFILE, - "Metadata Table base file format should be HFile"); - // Metadata table has a fixed number of partitions Review Comment: Going forward we'll have to remove this check as we can have multiple file formats even in metdata table when we support certain secondary indexes in other than HFile format. This check also did not add much value anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
codope commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1372534307 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/HoodieMultipleBaseFileFormat.scala: ## @@ -0,0 +1,278 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.mapreduce.Job +import org.apache.hudi.DataSourceReadOptions.{REALTIME_PAYLOAD_COMBINE_OPT_VAL, REALTIME_SKIP_MERGE_OPT_VAL} +import org.apache.hudi.MergeOnReadSnapshotRelation.createPartitionedFile +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.{FileSlice, HoodieLogFile} +import org.apache.hudi.{HoodieBaseRelation, HoodieTableSchema, HoodieTableState, LogFileIterator, MergeOnReadSnapshotRelation, PartitionFileSliceMapping, RecordMergingFileIterator, SparkAdapterSupport} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.sql.HoodieCatalystExpressionUtils.generateUnsafeProjection +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.JoinedRow +import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat +import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.types.{StructField, StructType} +import org.apache.spark.util.SerializableConfiguration + +import scala.collection.mutable +import scala.jdk.CollectionConverters.asScalaIteratorConverter + +/** + * File format that supports reading multiple base file formats in a table. + */ +class HoodieMultipleBaseFileFormat(tableState: Broadcast[HoodieTableState], + tableSchema: Broadcast[HoodieTableSchema], + tableName: String, + mergeType: String, + mandatoryFields: Seq[String], + isMOR: Boolean) extends FileFormat with SparkAdapterSupport { + private val parquetFormat = new ParquetFileFormat() + private val orcFormat = new OrcFileFormat() + + override def inferSchema(sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = { +// This is a simple heuristic assuming all files have the same extension. +val fileFormat = detectFileFormat(files.head.getPath.toString) + +fileFormat match { + case "parquet" => parquetFormat.inferSchema(sparkSession, options, files) + case "orc" => orcFormat.inferSchema(sparkSession, options, files) + case _ => throw new UnsupportedOperationException(s"File format $fileFormat is not supported.") +} + } + + override def isSplitable(sparkSession: SparkSession, options: Map[String, String], path: Path): Boolean = { +false + } + + // Used so that the planner only projects once and does not stack overflow + var isProjected = false + + /** + * Support batch needs to remain consistent, even if one side of a bootstrap merge can support + * while the other side can't + */ + private var supportBatchCalled = false + private var supportBatchResult = false + + override def supportBatch(sparkSession: SparkSession, schema: StructType): Boolean = { +if (!supportBatchCalled) { + supportBatchCalled = true + supportBatchResult = +!isMOR && parquetFormat.supportBatch(sparkSession, schema) && orcFormat.supportBatch(sparkSession, schema) +} +supportBatchResult + } + + override def prepareWrite(sparkSession: SparkSession, +job: Job, +options: Map[String, String], +dataSchema: StructType): OutputWriterFactory = { +throw new UnsupportedOperationException("Write operations are not supported in this example.") + } + + override def buildReaderWithPartitionValues(sparkSession: SparkSession, +
Re: [I] [SUPPORT] HoodieCompaction with schema parse NullPointerException [hudi]
zyclove commented on issue #9902: URL: https://github.com/apache/hudi/issues/9902#issuecomment-1780347759 In addition, when submitting a task with spark-submit, in addition to adding configuration in the code or specifying a configuration file, can the configuration be added dynamically when submitting the task? @ad1happy2go -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] HoodieCompaction with schema parse NullPointerException [hudi]
zyclove commented on issue #9902: URL: https://github.com/apache/hudi/issues/9902#issuecomment-1780345033 [hoodie.avro.schema.external.transformation](https://hudi.apache.org/docs/configurations#hoodieavroschemaexternaltransformation) Check the hudi code to see if you can set this configuration to true. ```java public static final ConfigProperty AVRO_EXTERNAL_SCHEMA_TRANSFORMATION_ENABLE = ConfigProperty .key(AVRO_SCHEMA_STRING.key() + ".external.transformation") .defaultValue("false") .withAlternatives(AVRO_SCHEMA_STRING.key() + ".externalTransformation") .markAdvanced() .withDocumentation("When enabled, records in older schema are rewritten into newer schema during upsert,delete and background" + " compaction,clustering operations."); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
danny0405 commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1372503591 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java: ## @@ -602,10 +601,6 @@ private void runFullValidation(HoodieMetadataConfig metadataConfig, // Metadata table is MOR assertEquals(metadataMetaClient.getTableType(), HoodieTableType.MERGE_ON_READ, "Metadata Table should be MOR"); Review Comment: Why remove the check? ## hudi-client/hudi-java-client/src/test/java/org/apache/hudi/client/TestJavaHoodieBackedMetadata.java: ## @@ -2763,10 +2762,6 @@ private void validateMetadata(HoodieJavaWriteClient testClient, Option i // Metadata table is MOR assertEquals(metadataMetaClient.getTableType(), HoodieTableType.MERGE_ON_READ, "Metadata Table should be MOR"); - // Metadata table is HFile format - assertEquals(metadataMetaClient.getTableConfig().getBaseFileFormat(), HoodieFileFormat.HFILE, - "Metadata Table base file format should be HFile"); - // Metadata table has a fixed number of partitions Review Comment: Why remove this check ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala: ## @@ -456,7 +456,7 @@ trait ProvidesHoodieConfig extends Logging { hiveSyncConfig.setValue(HiveSyncConfigHolder.HIVE_SYNC_ENABLED.key, enableHive.toString) hiveSyncConfig.setValue(HiveSyncConfigHolder.HIVE_SYNC_MODE.key, props.getString(HiveSyncConfigHolder.HIVE_SYNC_MODE.key, HiveSyncMode.HMS.name())) hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_BASE_PATH, hoodieCatalogTable.tableLocation) -hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT, hoodieCatalogTable.baseFileFormat) +hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT, props.getString(HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT.key, HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT.defaultValue)) hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_DATABASE_NAME, hoodieCatalogTable.table.identifier.database.getOrElse("default")) Review Comment: Do we have function regression if user does not provide the option `HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT`? ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/HoodieMultipleBaseFileFormat.scala: ## @@ -0,0 +1,278 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.mapreduce.Job +import org.apache.hudi.DataSourceReadOptions.{REALTIME_PAYLOAD_COMBINE_OPT_VAL, REALTIME_SKIP_MERGE_OPT_VAL} +import org.apache.hudi.MergeOnReadSnapshotRelation.createPartitionedFile +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.{FileSlice, HoodieLogFile} +import org.apache.hudi.{HoodieBaseRelation, HoodieTableSchema, HoodieTableState, LogFileIterator, MergeOnReadSnapshotRelation, PartitionFileSliceMapping, RecordMergingFileIterator, SparkAdapterSupport} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.sql.HoodieCatalystExpressionUtils.generateUnsafeProjection +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.JoinedRow +import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat +import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.types.{StructField, StructType} +import org.apache.spark.util.SerializableConfiguration + +import scala.collection.mutable +import scala.jdk.CollectionConverters.asScalaIteratorConverter + +/** + * File format that supports reading multiple base file formats in a table. + */ +class HoodieMultipleBaseFileFormat(tableState: Broadcast[HoodieTableState], + tableSchema: Broadcast[HoodieTableSchema], +
Re: [I] [SUPPORT] HoodieCompaction with schema parse NullPointerException [hudi]
zyclove commented on issue #9902: URL: https://github.com/apache/hudi/issues/9902#issuecomment-1780338605 @ad1happy2go In another task, after upgrading to version 0.14, field incompatibility issues were reported. Can it be restored without rebuilding the data table? For example, through the Schema Evolution feature ``` Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: org.apache.avro.AvroRuntimeException: cannot support rewrite value for schema type: "long" since the old schema type is: "string" at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:387) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:369) at org.apache.hudi.table.action.deltacommit.BaseSparkDeltaCommitActionExecutor.handleUpdate(BaseSparkDeltaCommitActionExecutor.java:79) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:335) ... 28 more Caused by: org.apache.hudi.exception.HoodieException: org.apache.avro.AvroRuntimeException: cannot support rewrite value for schema type: "long" since the old schema type is: "string" at org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:75) at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:147) ... 32 more Caused by: org.apache.avro.AvroRuntimeException: cannot support rewrite value for schema type: "long" since the old schema type is: "string" at org.apache.hudi.avro.HoodieAvroUtils.rewritePrimaryTypeWithDiffSchemaType(HoodieAvroUtils.java:1083) at org.apache.hudi.avro.HoodieAvroUtils.rewritePrimaryType(HoodieAvroUtils.java:1001) at org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchemaInternal(HoodieAvroUtils.java:946) at org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:873) at org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchemaInternal(HoodieAvroUtils.java:944) at org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:873) at org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchemaInternal(HoodieAvroUtils.java:902) at org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:873) at org.apache.hudi.avro.HoodieAvroUtils.rewriteRecordWithNewSchema(HoodieAvroUtils.java:843) at org.apache.hudi.common.model.HoodieAvroIndexedRecord.rewriteRecordWithNewSchema(HoodieAvroIndexedRecord.java:123) at org.apache.hudi.table.action.commit.HoodieMergeHelper.lambda$composeSchemaEvolutionTransformer$2(HoodieMergeHelper.java:209) at org.apache.hudi.table.action.commit.HoodieMergeHelper.lambda$runMerge$0(HoodieMergeHelper.java:134) at org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:68) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6949) Spark support non-blocking concurrency control
[ https://issues.apache.org/jira/browse/HUDI-6949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6949: - Labels: pull-request-available (was: ) > Spark support non-blocking concurrency control > -- > > Key: HUDI-6949 > URL: https://issues.apache.org/jira/browse/HUDI-6949 > Project: Apache Hudi > Issue Type: New Feature > Components: spark, spark-sql >Reporter: Jing Zhang >Assignee: Jing Zhang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-6949] Spark support non-blocking concurrency control [hudi]
beyond1920 opened a new pull request, #9921: URL: https://github.com/apache/hudi/pull/9921 ### Change Logs The pr aims to support non-blocking concurrency control for spark jobs. ### Impact NA ### Risk level (write none, low medium or high below) NA ### Documentation Update NA ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6988) Query failure for 0.14.0
Lin Liu created HUDI-6988: - Summary: Query failure for 0.14.0 Key: HUDI-6988 URL: https://issues.apache.org/jira/browse/HUDI-6988 Project: Apache Hudi Issue Type: Sub-task Reporter: Lin Liu {code:java} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 1054 (run at AccessController.java:0) has failed the maximum allowable number of times: 4. Most recent failure reason:org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 259 partition 31 at org.apache.spark.MapOutputTracker$.validateStatus(MapOutputTracker.scala:1705) at org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$10(MapOutputTracker.scala:1652) at org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$10$adapted(MapOutputTracker.scala:1651) at scala.collection.Iterator.foreach(Iterator.scala:943)at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1651) at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorIdImpl(MapOutputTracker.scala:1294) at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:1256) at org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:140) at org.apache.spark.shuffle.ShuffleManager.getReader(ShuffleManager.scala:63) at org.apache.spark.shuffle.ShuffleManager.getReader$(ShuffleManager.scala:57) at org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:73) at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:208) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:138) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2798) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2798) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1995) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3048) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2993) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2982) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.checkNoFailures(AdaptiveExecutor.scala:154) at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.doRun(AdaptiveExecutor.scala:88) at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.tryRunningAndGetFuture(AdaptiveExecutor.scala:66) at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.execute(AdaptiveExecutor.scala:57) at
Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]
ksmou commented on PR #9911: URL: https://github.com/apache/hudi/pull/9911#issuecomment-1780305248 > I see some test failures: > > ```java > testUpsertsCOWContinuousMode{HoodieRecordType}[1] Time elapsed: 396.414 s <<< ERROR! > ``` > > Not sure whether it is related. It's is not related. I test it local successful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Schema evolution copy [hudi]
hudi-bot commented on PR #9920: URL: https://github.com/apache/hudi/pull/9920#issuecomment-1780282165 ## CI report: * f98cbcb16737a88891703baeee15f5a6bd73e784 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20493) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780282051 ## CI report: * 6972591365be4bde76c7b41dc5122c63ffd18c79 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20495) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]
danny0405 commented on PR #9911: URL: https://github.com/apache/hudi/pull/9911#issuecomment-1780282030 I see some test failures: ```java testUpsertsCOWContinuousMode{HoodieRecordType}[1] Time elapsed: 396.414 s <<< ERROR! ``` Not sure whether it is related. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6975] Optimize the code of DayBasedCompactionStrategy [hudi]
danny0405 commented on code in PR #9911: URL: https://github.com/apache/hudi/pull/9911#discussion_r1371363530 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/strategy/DayBasedCompactionStrategy.java: ## @@ -63,21 +60,9 @@ public Comparator getComparator() { return comparator; } - @Override - public List orderAndFilter(HoodieWriteConfig writeConfig, - List operations, List pendingCompactionPlans) { -// Iterate through the operations and accept operations as long as we are within the configured target partitions -// limit -return operations.stream() - .collect(Collectors.groupingBy(HoodieCompactionOperation::getPartitionPath)).entrySet().stream() - .sorted(Map.Entry.comparingByKey(comparator)).limit(writeConfig.getTargetPartitionsPerDayBasedCompaction()) -.flatMap(e -> e.getValue().stream()).collect(Collectors.toList()); - } - @Override public List filterPartitionPaths(HoodieWriteConfig writeConfig, List allPartitionPaths) { -return allPartitionPaths.stream().map(partition -> partition.replace("/", "-")) -.sorted(Comparator.reverseOrder()).map(partitionPath -> partitionPath.replace("-", "/")) +return allPartitionPaths.stream().sorted(comparator) .collect(Collectors.toList()).subList(0, Math.min(allPartitionPaths.size(), Review Comment: Okay, got it, caution that you have changed the comparator of the partitions, does that introduce any protential regressions ? Can we add some tests to conver it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6987) Support partition pruning with functional index
[ https://issues.apache.org/jira/browse/HUDI-6987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6987: -- Fix Version/s: 1.0.0 > Support partition pruning with functional index > --- > > Key: HUDI-6987 > URL: https://issues.apache.org/jira/browse/HUDI-6987 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Fix For: 1.0.0 > > > Current implementation can do data skipping if functional index exists. The > same can be leveraged for partition pruning if the function is on partition > field. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6987) Support partition pruning with functional index
Sagar Sumit created HUDI-6987: - Summary: Support partition pruning with functional index Key: HUDI-6987 URL: https://issues.apache.org/jira/browse/HUDI-6987 Project: Apache Hudi Issue Type: Task Reporter: Sagar Sumit Current implementation can do data skipping if functional index exists. The same can be leveraged for partition pruning if the function is on partition field. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6986) Refactor new FileFormat implementations
Sagar Sumit created HUDI-6986: - Summary: Refactor new FileFormat implementations Key: HUDI-6986 URL: https://issues.apache.org/jira/browse/HUDI-6986 Project: Apache Hudi Issue Type: Improvement Reporter: Sagar Sumit Fix For: 1.0.0 * Rename `NewHoodieParquetFileFormat` * Remove duplication between `NewHoodieParquetFileFormat`, `HoodieFileGroupReaderBasedFileFormat` and `HoodieMultipleBaseFileFormat` * `HoodieSparkFormatUtils` should be usable irrespective of whether one is using new file format or not. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] Executor executes action [commits the instant 20230916074105355] error [hudi]
gtk96 commented on issue #9732: URL: https://github.com/apache/hudi/issues/9732#issuecomment-1780260311 > @gtk96 Were you able to confirm. Can we close this issue. hi @ad1happy2go Our current version is 0.13 and has not been upgraded. I can't verify this. If you have confirmed to solve the problem, please close it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1372442062 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java: ## @@ -126,12 +128,13 @@ protected Option doProcessNextDataRecord(T record, // Merge and store the combined record // Note that the incoming `record` is from an older commit, so it should be put as // the `older` in the merge API + HoodieRecord combinedRecord = (HoodieRecord) recordMerger.merge( - readerContext.constructHoodieRecord(Option.of(record), metadata, readerSchema), - readerSchema, + readerContext.constructHoodieRecord(Option.of(record), metadata), + (Schema) metadata.get(INTERNAL_META_SCHEMA), readerContext.constructHoodieRecord( - existingRecordMetadataPair.getLeft(), existingRecordMetadataPair.getRight(), readerSchema), - readerSchema, + existingRecordMetadataPair.getLeft(), existingRecordMetadataPair.getRight()), + (Schema) existingRecordMetadataPair.getRight().get(INTERNAL_META_SCHEMA), payloadProps).get().getLeft(); Review Comment: To clarify, for reading log files, the reader schema is fetched from the header. Here we're doing record-level merging. Depending the log file from which the records come, the schema could be different. However, the reference to the schema is the same as the schema instance is passed from the log reader. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780214298 ## CI report: * 085a8583eb56ff4b8d3afa3636c657b11d0db92f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20491) * 6972591365be4bde76c7b41dc5122c63ffd18c79 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20495) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1780214192 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * 661b16906d31d259be3fac4707478bd71eb6f9a4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20494) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6985) Cannot find complete timestamp
[ https://issues.apache.org/jira/browse/HUDI-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu reassigned HUDI-6985: - Assignee: Lin Liu > Cannot find complete timestamp > -- > > Key: HUDI-6985 > URL: https://issues.apache.org/jira/browse/HUDI-6985 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > > {code:java} > Caused by: java.lang.IllegalArgumentException: Completion time should not be > empty at > org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42) > at > org.apache.hudi.common.table.timeline.HoodieInstant.getCompleteFileName(HoodieInstant.java:263) > at > org.apache.hudi.common.table.timeline.HoodieInstant.getFileName(HoodieInstant.java:297) > at > org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantFileName(HoodieActiveTimeline.java:344) > at > org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:351) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.getRollbackedCommits(HoodieTableMetadataUtil.java:1372) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$getValidInstantTimestamps$38(HoodieTableMetadataUtil.java:1300) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) > at > java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) > at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.getValidInstantTimestamps(HoodieTableMetadataUtil.java:1299) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.getLogRecordScanner(HoodieBackedTableMetadata.java:476) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.openReaders(HoodieBackedTableMetadata.java:432) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.getOrCreateReaders(HoodieBackedTableMetadata.java:417) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.lookupKeysFromFileSlice(HoodieBackedTableMetadata.java:294) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:258) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordByKey(HoodieBackedTableMetadata.java:148) > at > org.apache.hudi.metadata.BaseTableMetadata.fetchAllPartitionPaths(BaseTableMetadata.java:316) > at > org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:125) > ... 61 more > at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?] > at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?] > at > com.microsoft.lst_bench.common.LSTBenchmarkExecutor.checkResults(LSTBenchmarkExecutor.java:165) > [lst-bench-0.1-SNAPSHOT.jar:?] at > com.microsoft.lst_bench.common.LSTBenchmarkExecutor.execute(LSTBenchmarkExecutor.java:121) > [lst-bench-0.1-SNAPSHOT.jar:?] at > com.microsoft.lst_bench.Driver.main(Driver.java:147) > [lst-bench-0.1-SNAPSHOT.jar:?]Caused by: java.sql.SQLException: > org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.hudi.exception.HoodieException: Error fetching partition paths > from metadata table at > org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:44) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:325) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:230) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79) > at >
[jira] [Created] (HUDI-6985) Cannot find complete timestamp
Lin Liu created HUDI-6985: - Summary: Cannot find complete timestamp Key: HUDI-6985 URL: https://issues.apache.org/jira/browse/HUDI-6985 Project: Apache Hudi Issue Type: Sub-task Reporter: Lin Liu {code:java} Caused by: java.lang.IllegalArgumentException: Completion time should not be empty at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42) at org.apache.hudi.common.table.timeline.HoodieInstant.getCompleteFileName(HoodieInstant.java:263) at org.apache.hudi.common.table.timeline.HoodieInstant.getFileName(HoodieInstant.java:297) at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantFileName(HoodieActiveTimeline.java:344) at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.getInstantDetails(HoodieActiveTimeline.java:351) at org.apache.hudi.metadata.HoodieTableMetadataUtil.getRollbackedCommits(HoodieTableMetadataUtil.java:1372) at org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$getValidInstantTimestamps$38(HoodieTableMetadataUtil.java:1300) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) at org.apache.hudi.metadata.HoodieTableMetadataUtil.getValidInstantTimestamps(HoodieTableMetadataUtil.java:1299) at org.apache.hudi.metadata.HoodieBackedTableMetadata.getLogRecordScanner(HoodieBackedTableMetadata.java:476) at org.apache.hudi.metadata.HoodieBackedTableMetadata.openReaders(HoodieBackedTableMetadata.java:432) at org.apache.hudi.metadata.HoodieBackedTableMetadata.getOrCreateReaders(HoodieBackedTableMetadata.java:417) at org.apache.hudi.metadata.HoodieBackedTableMetadata.lookupKeysFromFileSlice(HoodieBackedTableMetadata.java:294) at org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:258) at org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordByKey(HoodieBackedTableMetadata.java:148) at org.apache.hudi.metadata.BaseTableMetadata.fetchAllPartitionPaths(BaseTableMetadata.java:316) at org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:125) ... 61 more at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:?] at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?] at com.microsoft.lst_bench.common.LSTBenchmarkExecutor.checkResults(LSTBenchmarkExecutor.java:165) [lst-bench-0.1-SNAPSHOT.jar:?] at com.microsoft.lst_bench.common.LSTBenchmarkExecutor.execute(LSTBenchmarkExecutor.java:121) [lst-bench-0.1-SNAPSHOT.jar:?] at com.microsoft.lst_bench.Driver.main(Driver.java:147) [lst-bench-0.1-SNAPSHOT.jar:?]Caused by: java.sql.SQLException: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.hudi.exception.HoodieException: Error fetching partition paths from metadata table at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:44) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:325) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:230) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:230) at
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780208856 ## CI report: * 085a8583eb56ff4b8d3afa3636c657b11d0db92f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20491) * 6972591365be4bde76c7b41dc5122c63ffd18c79 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1780208689 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * f98cbcb16737a88891703baeee15f5a6bd73e784 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20485) * 661b16906d31d259be3fac4707478bd71eb6f9a4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-6793) Support time-travel read in engine-agnostic FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779694#comment-17779694 ] Ethan Guo edited comment on HUDI-6793 at 10/25/23 11:11 PM: This works after adding MOR snapshot query support with the new Hoodie parquet file format using new file group reader: HUDI-6786. was (Author: guoyihua): This works after adding MOR snapshot query support with the new Hoodie parquet file format using new file group reader. > Support time-travel read in engine-agnostic FileGroupReader > --- > > Key: HUDI-6793 > URL: https://issues.apache.org/jira/browse/HUDI-6793 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Lin Liu >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-6793) Support time-travel read in engine-agnostic FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo resolved HUDI-6793. - > Support time-travel read in engine-agnostic FileGroupReader > --- > > Key: HUDI-6793 > URL: https://issues.apache.org/jira/browse/HUDI-6793 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Lin Liu >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6793) Support time-travel read in engine-agnostic FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-6793. --- Resolution: Fixed > Support time-travel read in engine-agnostic FileGroupReader > --- > > Key: HUDI-6793 > URL: https://issues.apache.org/jira/browse/HUDI-6793 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Lin Liu >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6793) Support time-travel read in engine-agnostic FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779694#comment-17779694 ] Ethan Guo commented on HUDI-6793: - This works after adding MOR snapshot query support with the new Hoodie parquet file format using new file group reader. > Support time-travel read in engine-agnostic FileGroupReader > --- > > Key: HUDI-6793 > URL: https://issues.apache.org/jira/browse/HUDI-6793 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Lin Liu >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6973) Instantiate HoodieFileGroupRecordBuffer inside new file group reader
[ https://issues.apache.org/jira/browse/HUDI-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-6973. --- Resolution: Fixed > Instantiate HoodieFileGroupRecordBuffer inside new file group reader > > > Key: HUDI-6973 > URL: https://issues.apache.org/jira/browse/HUDI-6973 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6800) Implement log writing with partial updates on the write path
[ https://issues.apache.org/jira/browse/HUDI-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-6800. --- Resolution: Fixed > Implement log writing with partial updates on the write path > > > Key: HUDI-6800 > URL: https://issues.apache.org/jira/browse/HUDI-6800 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] Schema evolution copy [hudi]
hudi-bot commented on PR #9920: URL: https://github.com/apache/hudi/pull/9920#issuecomment-1780169741 ## CI report: * f98cbcb16737a88891703baeee15f5a6bd73e784 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20493) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Schema evolution copy [hudi]
hudi-bot commented on PR #9920: URL: https://github.com/apache/hudi/pull/9920#issuecomment-1780162703 ## CI report: * f98cbcb16737a88891703baeee15f5a6bd73e784 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6800] Support writing partial updates to the data blocks in MOR tables (#9876)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 0ad4560f2a4 [HUDI-6800] Support writing partial updates to the data blocks in MOR tables (#9876) 0ad4560f2a4 is described below commit 0ad4560f2a4de00e43814b0d6cef2886a8a38155 Author: Y Ethan Guo AuthorDate: Wed Oct 25 15:24:26 2023 -0700 [HUDI-6800] Support writing partial updates to the data blocks in MOR tables (#9876) This commit adds the functionality to write partial updates to the data blocks in MOR tables, for Spark SQL MERGE INTO. --- .../org/apache/hudi/config/HoodieWriteConfig.java | 18 ++- .../org/apache/hudi/io/HoodieAppendHandle.java | 18 ++- .../java/org/apache/hudi/io/HoodieWriteHandle.java | 2 +- .../common/table/log/block/HoodieLogBlock.java | 2 +- .../org/apache/hudi/common/util/ConfigUtils.java | 20 +-- .../scala/org/apache/hudi/DataSourceOptions.scala | 9 ++ .../hudi/command/MergeIntoHoodieTableCommand.scala | 147 +++-- .../hudi/command/payload/ExpressionPayload.scala | 20 ++- .../apache/spark/sql/hudi/TestMergeIntoTable.scala | 12 +- .../spark/sql/hudi/TestMergeIntoTable2.scala | 6 + .../sql/hudi/TestPartialUpdateForMergeInto.scala | 83 ++-- 11 files changed, 268 insertions(+), 69 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java index 8c08beaaef9..cc3876338cc 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java @@ -33,6 +33,7 @@ import org.apache.hudi.common.config.HoodieMetaserverConfig; import org.apache.hudi.common.config.HoodieReaderConfig; import org.apache.hudi.common.config.HoodieStorageConfig; import org.apache.hudi.common.config.HoodieTableServiceManagerConfig; +import org.apache.hudi.common.config.HoodieTimeGeneratorConfig; import org.apache.hudi.common.config.TypedProperties; import org.apache.hudi.common.engine.EngineType; import org.apache.hudi.common.fs.ConsistencyGuardConfig; @@ -50,7 +51,6 @@ import org.apache.hudi.common.model.WriteOperationType; import org.apache.hudi.common.table.HoodieTableConfig; import org.apache.hudi.common.table.log.block.HoodieLogBlock; import org.apache.hudi.common.table.marker.MarkerType; -import org.apache.hudi.common.config.HoodieTimeGeneratorConfig; import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion; import org.apache.hudi.common.table.view.FileSystemViewStorageConfig; import org.apache.hudi.common.util.ConfigUtils; @@ -756,6 +756,14 @@ public class HoodieWriteConfig extends HoodieConfig { .withDocumentation("Whether to write record positions to the block header for data blocks containing updates and delete blocks. " + "The record positions can be used to improve the performance of merging records from base and log files."); + public static final ConfigProperty WRITE_PARTIAL_UPDATE_SCHEMA = ConfigProperty + .key("hoodie.write.partial.update.schema") + .defaultValue("") + .markAdvanced() + .sinceVersion("1.0.0") + .withDocumentation("Avro schema of the partial updates. This is automatically set by the " + + "Hudi write client and user is not expected to manually change the value."); + /** * Config key with boolean value that indicates whether record being written during MERGE INTO Spark SQL * operation are already prepped. @@ -2072,6 +2080,14 @@ public class HoodieWriteConfig extends HoodieConfig { return getBoolean(WRITE_RECORD_POSITIONS); } + public boolean shouldWritePartialUpdates() { +return !StringUtils.isNullOrEmpty(getString(WRITE_PARTIAL_UPDATE_SCHEMA)); + } + + public String getPartialUpdateSchema() { +return getString(WRITE_PARTIAL_UPDATE_SCHEMA); + } + public double getParquetCompressionRatio() { return getDouble(HoodieStorageConfig.PARQUET_COMPRESSION_RATIO_FRACTION); } diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java index 4075541a750..cc1932ce27f 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java @@ -149,7 +149,14 @@ public class HoodieAppendHandle extends HoodieWriteHandle hoodieTable, String partitionPath, String fileId, Iterator> recordItr, TaskContextSupplier taskContextSupplier) { -super(config, instantTime,
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua merged PR #9876: URL: https://github.com/apache/hudi/pull/9876 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] org.apache.hudi.exception.HoodieRollbackException: Failed to rollback [hudi]
Armelabdelkbir commented on issue #9213: URL: https://github.com/apache/hudi/issues/9213#issuecomment-1780128455 @ad1happy2go it seems good for me last months, sorry for the late response -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] Schema evolution copy [hudi]
jonvex opened a new pull request, #9920: URL: https://github.com/apache/hudi/pull/9920 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6984) query64 failed.
Lin Liu created HUDI-6984: - Summary: query64 failed. Key: HUDI-6984 URL: https://issues.apache.org/jira/browse/HUDI-6984 Project: Apache Hudi Issue Type: Sub-task Reporter: Lin Liu Assignee: Lin Liu {code:java} [hadoop@ip-10-0-112-196 lst-bench]$ 2023-10-25T21:52:19,829 ERROR [pool-2-thread-1] common.LSTBenchmarkExecutor: Exception executing statement: query64.sql_02023-10-25T21:52:19,829 ERROR [pool-2-thread-1] common.LSTBenchmarkExecutor: Exception executing file: query64.sql2023-10-25T21:52:19,830 ERROR [pool-2-thread-1] common.LSTBenchmarkExecutor: Exception executing task: single_user_02023-10-25T21:52:19,834 ERROR [pool-2-thread-1] common.LSTBenchmarkExecutor: Exception executing session: 02023-10-25T21:52:19,834 WARN [main] common.LSTBenchmarkExecutor: Thread did not finish correctlyjava.util.concurrent.ExecutionException: java.sql.SQLException: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 851 in stage 3093.0 failed 4 times, most recent failure: Lost task 851.3 in stage 3093.0 (TID 666996) (ip-10-0-103-0.us-west-2.compute.internal executor 7): ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Executor Process LostDriver stacktrace: at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:43) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:325) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:230) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:230) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:225) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:239) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 851 in stage 3093.0 failed 4 times, most recent failure: Lost task 851.3 in stage 3093.0 (TID 666996) (ip-10-0-103-0.us-west-2.compute.internal executor 7): ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Executor Process LostDriver stacktrace:at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2798) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2798) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1239) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1239) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1239) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3051) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2993) at
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780111072 ## CI report: * 085a8583eb56ff4b8d3afa3636c657b11d0db92f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20491) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780101745 ## CI report: * 57481f626caf8864def8394c57316535fa490b90 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20490) * 085a8583eb56ff4b8d3afa3636c657b11d0db92f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1780092114 ## CI report: * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * eb5b62e94807c1b2b6942402b117fe9dc57d425b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20487) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]
yihua commented on code in PR #9894: URL: https://github.com/apache/hudi/pull/9894#discussion_r1372349415 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java: ## @@ -119,21 +124,36 @@ protected Option doProcessNextDataRecord(T record, Map metadata, Pair, Map> existingRecordMetadataPair) throws IOException { if (existingRecordMetadataPair != null) { - // Merge and store the combined record - // Note that the incoming `record` is from an older commit, so it should be put as - // the `older` in the merge API - HoodieRecord combinedRecord = (HoodieRecord) recordMerger.merge( - readerContext.constructHoodieRecord(Option.of(record), metadata, readerSchema), - readerSchema, - readerContext.constructHoodieRecord( - existingRecordMetadataPair.getLeft(), existingRecordMetadataPair.getRight(), readerSchema), - readerSchema, - payloadProps).get().getLeft(); - // If pre-combine returns existing record, no need to update it - if (combinedRecord.getData() != existingRecordMetadataPair.getLeft().get()) { -return Option.of(combinedRecord.getData()); + switch (recordMergeMode) { +case OVERWRITE_WITH_LATEST: + return Option.empty(); +case EVENT_TIME_ORDERING: + Comparable incomingOrderingValue = readerContext.getOrderingValue( + Option.of(record), metadata, readerSchema, payloadProps); + Comparable existingOrderingValue = readerContext.getOrderingValue( + existingRecordMetadataPair.getLeft(), existingRecordMetadataPair.getRight(), readerSchema, payloadProps); + if (incomingOrderingValue.compareTo(existingOrderingValue) > 0) { +return Option.of(record); + } + return Option.empty(); Review Comment: Yes, `existingRecordMetadataPair` should be in the log record mapping. The convention here is that, if `Option.empty()` is returned from this method, the log record of the same record key in the mapping should not be updated, to avoid the `readerContext.seal`: ``` @Override public void processNextDataRecord(T record, Map metadata, Object recordKey) throws IOException { Pair, Map> existingRecordMetadataPair = records.get(recordKey); Option mergedRecord = doProcessNextDataRecord(record, metadata, existingRecordMetadataPair); if (mergedRecord.isPresent()) { records.put(recordKey, Pair.of(Option.ofNullable(readerContext.seal(mergedRecord.get())), metadata)); } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]
yihua commented on code in PR #9894: URL: https://github.com/apache/hudi/pull/9894#discussion_r1372348212 ## hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java: ## @@ -1276,5 +1291,35 @@ public HoodieTableMetaClient initTable(Configuration configuration, String baseP throws IOException { return HoodieTableMetaClient.initTableAndGetMetaClient(configuration, basePath, build()); } + +private void validateMergeConfigs() { + boolean payloadClassNameSet = null != payloadClassName; + boolean payloadTypeSet = null != payloadType; + boolean recordMergerStrategySet = null != recordMergerStrategy; + boolean recordMergeModeSet = null != recordMergeMode; + + checkArgument(recordMergeModeSet, + "Record merge mode " + HoodieTableConfig.RECORD_MERGE_MODE.key() + " should be set"); Review Comment: This is mandatory in the table config and during table upgrade, the merge mode should inferred from either the payload class name / type or record merger strategy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]
codope commented on code in PR #9894: URL: https://github.com/apache/hudi/pull/9894#discussion_r1372317061 ## hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java: ## @@ -1276,5 +1291,35 @@ public HoodieTableMetaClient initTable(Configuration configuration, String baseP throws IOException { return HoodieTableMetaClient.initTableAndGetMetaClient(configuration, basePath, build()); } + +private void validateMergeConfigs() { + boolean payloadClassNameSet = null != payloadClassName; + boolean payloadTypeSet = null != payloadType; + boolean recordMergerStrategySet = null != recordMergerStrategy; + boolean recordMergeModeSet = null != recordMergeMode; + + checkArgument(recordMergeModeSet, + "Record merge mode " + HoodieTableConfig.RECORD_MERGE_MODE.key() + " should be set"); Review Comment: Is it a mandatory config? How will it affect users upgrading to new version? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]
codope commented on code in PR #9894: URL: https://github.com/apache/hudi/pull/9894#discussion_r1372317061 ## hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java: ## @@ -1276,5 +1291,35 @@ public HoodieTableMetaClient initTable(Configuration configuration, String baseP throws IOException { return HoodieTableMetaClient.initTableAndGetMetaClient(configuration, basePath, build()); } + +private void validateMergeConfigs() { + boolean payloadClassNameSet = null != payloadClassName; + boolean payloadTypeSet = null != payloadType; + boolean recordMergerStrategySet = null != recordMergerStrategy; + boolean recordMergeModeSet = null != recordMergeMode; + + checkArgument(recordMergeModeSet, + "Record merge mode " + HoodieTableConfig.RECORD_MERGE_MODE.key() + " should be set"); Review Comment: Is it a mandatry config? How will it affect users upgrading to new version? ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java: ## @@ -119,21 +124,36 @@ protected Option doProcessNextDataRecord(T record, Map metadata, Pair, Map> existingRecordMetadataPair) throws IOException { if (existingRecordMetadataPair != null) { - // Merge and store the combined record - // Note that the incoming `record` is from an older commit, so it should be put as - // the `older` in the merge API - HoodieRecord combinedRecord = (HoodieRecord) recordMerger.merge( - readerContext.constructHoodieRecord(Option.of(record), metadata, readerSchema), - readerSchema, - readerContext.constructHoodieRecord( - existingRecordMetadataPair.getLeft(), existingRecordMetadataPair.getRight(), readerSchema), - readerSchema, - payloadProps).get().getLeft(); - // If pre-combine returns existing record, no need to update it - if (combinedRecord.getData() != existingRecordMetadataPair.getLeft().get()) { -return Option.of(combinedRecord.getData()); + switch (recordMergeMode) { +case OVERWRITE_WITH_LATEST: + return Option.empty(); +case EVENT_TIME_ORDERING: + Comparable incomingOrderingValue = readerContext.getOrderingValue( + Option.of(record), metadata, readerSchema, payloadProps); + Comparable existingOrderingValue = readerContext.getOrderingValue( + existingRecordMetadataPair.getLeft(), existingRecordMetadataPair.getRight(), readerSchema, payloadProps); + if (incomingOrderingValue.compareTo(existingOrderingValue) > 0) { +return Option.of(record); + } + return Option.empty(); Review Comment: Why empty? Should it not be from `existingRecordMetadataPair`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780038520 ## CI report: * 57481f626caf8864def8394c57316535fa490b90 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20490) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1372307509 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecord.java: ## @@ -195,6 +206,10 @@ public HoodieKey getKey() { return key; } + public boolean isPartial() { +return isPartial; Review Comment: I removed all changes to the `HoodieRecord` and subclasses. Now whether a record is partial or not is determined by the schema attached, which is per log file. Checking whether a schema is partial or not also leverages cache (see `SparkPartialMergingUtils`), so there is no overhead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
hudi-bot commented on PR #9883: URL: https://github.com/apache/hudi/pull/9883#issuecomment-1780026429 ## CI report: * 985e9f099aff341d7d0cec4384ef82b7dcdd4de8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20469) * 57481f626caf8864def8394c57316535fa490b90 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1372304952 ## hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java: ## @@ -67,6 +70,7 @@ public abstract class HoodieReaderContext { * file. * * @param filePath {@link Path} instance of a file. + * @param isLogFile Whether this is a log file. * @param start Starting byte to start reading. Review Comment: Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]
hudi-bot commented on PR #9894: URL: https://github.com/apache/hudi/pull/9894#issuecomment-1779951472 ## CI report: * be208c2f40cdf7e82abc2d1627bf21f7ad509f71 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20489) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]
hudi-bot commented on PR #9894: URL: https://github.com/apache/hudi/pull/9894#issuecomment-1779872587 ## CI report: * 75e98fe81be61e02f30d41d798ea86b733a26e2a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20448) * be208c2f40cdf7e82abc2d1627bf21f7ad509f71 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20489) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
hudi-bot commented on PR #9761: URL: https://github.com/apache/hudi/pull/9761#issuecomment-1779872241 ## CI report: * 4ec731d4168128cc93e3be5d7f6c444aceacb970 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20484) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]
hudi-bot commented on PR #9894: URL: https://github.com/apache/hudi/pull/9894#issuecomment-1779858774 ## CI report: * 75e98fe81be61e02f30d41d798ea86b733a26e2a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20448) * be208c2f40cdf7e82abc2d1627bf21f7ad509f71 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1779844736 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * f98cbcb16737a88891703baeee15f5a6bd73e784 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20485) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Add table name and range msg for streaming reads logs [hudi]
yihua merged PR #9912: URL: https://github.com/apache/hudi/pull/9912 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] Add table name and range msg for streaming reads logs (#9912)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 250456f3fba [MINOR] Add table name and range msg for streaming reads logs (#9912) 250456f3fba is described below commit 250456f3fba70d35a0cc8445d143d187bd3abd7e Author: zhuanshenbsj1 <34104400+zhuanshenb...@users.noreply.github.com> AuthorDate: Thu Oct 26 02:06:24 2023 +0800 [MINOR] Add table name and range msg for streaming reads logs (#9912) --- .../main/java/org/apache/hudi/common/table/log/InstantRange.java | 9 + .../org/apache/hudi/source/StreamReadMonitoringFunction.java | 3 ++- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java b/hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java index 6609ad085ef..96c7b0c0ddf 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/log/InstantRange.java @@ -57,6 +57,15 @@ public abstract class InstantRange implements Serializable { public abstract boolean isInRange(String instant); + @Override + public String toString() { +return "InstantRange{" ++ "startInstant='" + startInstant == null ? "null" : startInstant + '\'' ++ ", endInstant='" + endInstant == null ? "null" : endInstant + '\'' ++ ", rangeType='" + this.getClass().getSimpleName() + '\'' ++ '}'; + } + // - // Inner Class // - diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java index 6f0fd9253e2..86e32fe5a0a 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java @@ -226,9 +226,10 @@ public class StreamReadMonitoringFunction this.issuedOffset = result.getOffset(); LOG.info("\n" + "\n" ++ "-- table: {}\n" + "-- consumed to instant: {}\n" + "", -this.issuedInstant); +conf.getString(FlinkOptions.TABLE_NAME), this.issuedInstant); } @Override
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1779778741 ## CI report: * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467) * eb5b62e94807c1b2b6942402b117fe9dc57d425b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20487) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]
hudi-bot commented on PR #9888: URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779765638 ## CI report: * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN * 955944c19aa182a5231741fbf20888e517f6dafd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20486) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1779765466 ## CI report: * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467) * eb5b62e94807c1b2b6942402b117fe9dc57d425b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Control file sizing during FULL_RECORD bootstrap mode [hudi]
fenil25 commented on issue #9915: URL: https://github.com/apache/hudi/issues/9915#issuecomment-1779758399 Got it. Thanks @ad1happy2go Are bulk_insert and full_record bootstrap modes the same then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]
hudi-bot commented on PR #9888: URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779751539 ## CI report: * 43fcb4679d5e5dd9dfa92390c7408a1797f5a7fb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20483) * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN * 955944c19aa182a5231741fbf20888e517f6dafd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20486) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1779751137 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * 7c353cd134d555bf0adfb50a64f012b609e75308 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20463) * f98cbcb16737a88891703baeee15f5a6bd73e784 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20485) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Async Cleaner OOM / slowdown after creating a large Savepoint [hudi]
ehurheap commented on issue #9747: URL: https://github.com/apache/hudi/issues/9747#issuecomment-1779720173 Interesting. Thanks for the update @ad1happy2go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]
hudi-bot commented on PR #9888: URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779680860 ## CI report: * 43fcb4679d5e5dd9dfa92390c7408a1797f5a7fb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20483) * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN * 955944c19aa182a5231741fbf20888e517f6dafd UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6872] Simplify Out Of Box Schema Evolution Functionality [hudi]
hudi-bot commented on PR #9743: URL: https://github.com/apache/hudi/pull/9743#issuecomment-1779680453 ## CI report: * 097ef6176650413eef2a4c3581ca6e48ea43788f UNKNOWN * e32b58f7ce1880568566be0c8a6940ae2f3a1016 UNKNOWN * 7c353cd134d555bf0adfb50a64f012b609e75308 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20463) * f98cbcb16737a88891703baeee15f5a6bd73e784 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6790] Support incremental/CDC queries using HadoopFsRelation [hudi]
hudi-bot commented on PR #9888: URL: https://github.com/apache/hudi/pull/9888#issuecomment-1779667511 ## CI report: * 43fcb4679d5e5dd9dfa92390c7408a1797f5a7fb Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20483) * 2501f4ca40591cd9b2d94b5c4daa360aa6454cef UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file [hudi]
Armelabdelkbir commented on issue #9918: URL: https://github.com/apache/hudi/issues/9918#issuecomment-1779660398 missing columnar, do you mean schema evolution, sometimes we have schema evolution, but not for this usecase. what is the impact of upgrade on production i have hundred of tables and billions of rows, i need just to upgrade the hudi version and keep same metadata folders ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file [hudi]
ad1happy2go commented on issue #9918: URL: https://github.com/apache/hudi/issues/9918#issuecomment-1779633050 @Armelabdelkbir I recommend you upgrade your Hudi version to 0.12.3 or 0.13.1 or 0.14.0. It may happen due to missing column in later records compared to previous ones. Do you have any such scenario? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Hudi 0.13.1 compatibility issues with EMR-6.7.0 and EMR-6.11.1 [hudi]
ad1happy2go commented on issue #9919: URL: https://github.com/apache/hudi/issues/9919#issuecomment-1779622641 @Shubham21k I think you are using the wrong utilities bundle jar. There are two utility jars - Hudi-utility (contains hudi spark bundle classes also) and Hudi-slim-bundle. Can you try using Hudi-slim-bundle and spark3.3/spark3.2 bundle jar. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org