[I] [SUPPORT] HoodieMultiTableDeltaStreamer does not work as expected [hudi]
nttq1sub opened a new issue, #10246: URL: https://github.com/apache/hudi/issues/10246 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** When I run HoodieMultiTableDeltaStreamer with spark-on-k8s-operator. I saw it run but just one table that have data to fill in, how should I config to run it perfectly with more than 2 tables. How does it work when run on k8s cluster. Does 1 driver handle multiple tables or 1 driver each table. Does it process sequentially or parrallel on driver ? Thanks so much if have anyone explain these points for me? **To Reproduce** Steps to reproduce the behavior: 1. 2. 3. 4. **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.13.0 * Spark version : 3.2 * Hive version : 3.1.1 * Hadoop version : 2.3.0 * Storage (HDFS/S3/GCS..) : hdfs * Running on Docker? (yes/no) : kubenetes **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Error when overwrite and synchronize hive metastore with 0.14.0 [hudi]
xicm commented on issue #10170: URL: https://github.com/apache/hudi/issues/10170#issuecomment-1840178727 The infer function has been fixed, https://github.com/apache/hudi/pull/9816 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7078) Re-enable one test in TestNestedSchemaPruningOptimization
[ https://issues.apache.org/jira/browse/HUDI-7078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7078: - Labels: pull-request-available (was: ) > Re-enable one test in TestNestedSchemaPruningOptimization > - > > Key: HUDI-7078 > URL: https://issues.apache.org/jira/browse/HUDI-7078 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Labels: pull-request-available > > Currently "Test NestedSchemaPruning optimization unsuccessful" is disabled. > We need to triage the issue with new file format and file group reader and > re-enable it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7078] Re-enable TestNestedSchemaPruningOptimization [hudi]
linliu-code opened a new pull request, #10245: URL: https://github.com/apache/hudi/pull/10245 ### Change Logs Just try to re-enable the test. ### Impact Fixing the bugs. ### Risk level (write none, low medium or high below) LOW ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Error when overwrite and synchronize hive metastore with 0.14.0 [hudi]
xicm commented on issue #10170: URL: https://github.com/apache/hudi/issues/10170#issuecomment-1840157102 `META_SYNC_DATABASE_NAME` is inferred from `hoodie.database.name`, You can check that the property `hoodie.table.database` in hoodie.properties is empty. Nice avatar. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]
hudi-bot commented on PR #10167: URL: https://github.com/apache/hudi/pull/10167#issuecomment-1840147988 ## CI report: * 25ee0036413d10722d804e0c935162f2175d7934 UNKNOWN * bd88c58c4ca36bd39363637420cc018deb8bb056 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21197) * b36d6db6ce9750942c8265898f2a6d0e0fd0be6d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21302) * 82bf22c1772449bf32bdd9c98d72c273cd938487 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21306) * ce28d60f139e8baac53710b32774d38677abf37f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]
hudi-bot commented on PR #10167: URL: https://github.com/apache/hudi/pull/10167#issuecomment-1840139680 ## CI report: * 25ee0036413d10722d804e0c935162f2175d7934 UNKNOWN * bd88c58c4ca36bd39363637420cc018deb8bb056 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21197) * b36d6db6ce9750942c8265898f2a6d0e0fd0be6d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21302) * 82bf22c1772449bf32bdd9c98d72c273cd938487 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21306) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7172] Fix the timeline archiver to support concurrent writer [hudi]
hudi-bot commented on PR #10244: URL: https://github.com/apache/hudi/pull/10244#issuecomment-1840140053 ## CI report: * 2a078bb478c90005df6e65793b969a4a8765f13f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21303) * 3763acfffaf1ac4760865f11edeb1cf91a91942c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21308) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7171] Fix 'show partitions' not display rewritten partitions [hudi]
hudi-bot commented on PR #10242: URL: https://github.com/apache/hudi/pull/10242#issuecomment-1840139973 ## CI report: * ced3f383b16e16a2593c4c4cdd288e9a5de21a99 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21298) * b23c47f2029f3c9cbeecc3608c6dc00c2af684e9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7140] [DNM] Trial Patch to test CI run [hudi]
hudi-bot commented on PR #10176: URL: https://github.com/apache/hudi/pull/10176#issuecomment-1840139738 ## CI report: * 5e6fbb9988501485e6b66964f8dff66e8f0d4e50 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21264) * 574d9561fdf35a76412a1f1d968b0588be2454f9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21307) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7172] Fix the timeline archiver to support concurrent writer [hudi]
hudi-bot commented on PR #10244: URL: https://github.com/apache/hudi/pull/10244#issuecomment-1840131352 ## CI report: * 2a078bb478c90005df6e65793b969a4a8765f13f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21303) * 3763acfffaf1ac4760865f11edeb1cf91a91942c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7140] [DNM] Trial Patch to test CI run [hudi]
hudi-bot commented on PR #10176: URL: https://github.com/apache/hudi/pull/10176#issuecomment-1840131072 ## CI report: * 5e6fbb9988501485e6b66964f8dff66e8f0d4e50 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21264) * 574d9561fdf35a76412a1f1d968b0588be2454f9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]
hudi-bot commented on PR #10167: URL: https://github.com/apache/hudi/pull/10167#issuecomment-1840131015 ## CI report: * 25ee0036413d10722d804e0c935162f2175d7934 UNKNOWN * bd88c58c4ca36bd39363637420cc018deb8bb056 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21197) * b36d6db6ce9750942c8265898f2a6d0e0fd0be6d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21302) * 82bf22c1772449bf32bdd9c98d72c273cd938487 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]
hudi-bot commented on PR #10144: URL: https://github.com/apache/hudi/pull/10144#issuecomment-1840130915 ## CI report: * 363470311395f04bdd0462bc058a9b25bd94bc9f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21294) * 3252779dc0eecf1c7b455125f6ca116d540efed9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21305) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7171] Fix 'show partitions' not display rewritten partitions [hudi]
wecharyu commented on PR #10242: URL: https://github.com/apache/hudi/pull/10242#issuecomment-1840126819 cc: @boneanxs @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]
hudi-bot commented on PR #10167: URL: https://github.com/apache/hudi/pull/10167#issuecomment-1840123143 ## CI report: * 25ee0036413d10722d804e0c935162f2175d7934 UNKNOWN * bd88c58c4ca36bd39363637420cc018deb8bb056 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21197) * b36d6db6ce9750942c8265898f2a6d0e0fd0be6d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21302) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]
hudi-bot commented on PR #10144: URL: https://github.com/apache/hudi/pull/10144#issuecomment-1840123043 ## CI report: * 363470311395f04bdd0462bc058a9b25bd94bc9f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21294) * 3252779dc0eecf1c7b455125f6ca116d540efed9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]
hehuiyuan commented on code in PR #10209: URL: https://github.com/apache/hudi/pull/10209#discussion_r1414953701 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java: ## @@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception { // case2: empty table without data files Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ"); Review Comment: > Hmm, maybe we just fix the table type as to be in line with the hoodie.properties when there is inconsistency instead of throwing, WDYT ? Hi @danny0405 , It's ok. But when inconsistency occurs, users may not be aware of them. If you recommend this way, i will fix the table type as to be in line with the hoodie.properties ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java: ## @@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception { // case2: empty table without data files Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ"); Review Comment: It's ok. But when inconsistency occurs, users may not be aware of them. If you recommend this way, i will fix the table type as to be in line with the hoodie.properties -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [MINOR] Fixing view manager reuse with Embedded timeline server (#10240)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 70a2064525a [MINOR] Fixing view manager reuse with Embedded timeline server (#10240) 70a2064525a is described below commit 70a2064525a26abc57a33c019da2ccb520182ef5 Author: Sivabalan Narayanan AuthorDate: Mon Dec 4 22:45:39 2023 -0800 [MINOR] Fixing view manager reuse with Embedded timeline server (#10240) --- .../java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java index 5432e9b34ef..b89b5cdfa11 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java @@ -182,7 +182,7 @@ public class EmbeddedTimelineService { this.serviceConfig = timelineServiceConfBuilder.build(); server = timelineServiceCreator.create(context, hadoopConf.newCopy(), serviceConfig, -FSUtils.getFs(writeConfig.getBasePath(), hadoopConf.newCopy()), createViewManager()); +FSUtils.getFs(writeConfig.getBasePath(), hadoopConf.newCopy()), viewManager); serverPort = server.startService(); LOG.info("Started embedded timeline server at " + hostAddr + ":" + serverPort); }
Re: [PR] [MINOR] Fixing view manager reuse with Embedded timeline server [hudi]
codope merged PR #10240: URL: https://github.com/apache/hudi/pull/10240 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7159]Check the table type between hoodie.properies and table options [hudi]
hehuiyuan commented on code in PR #10209: URL: https://github.com/apache/hudi/pull/10209#discussion_r1414949932 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java: ## @@ -1020,6 +1020,7 @@ void testStreamReadEmptyTablePath() throws Exception { // case2: empty table without data files Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ"); Review Comment: It's ok. But when inconsistency occurs, users may not be aware of them. If you recommend this way, i will fix the table type as to be in line with the hoodie.properties -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7172] Fix the timeline archiver to support concurrent writer [hudi]
hudi-bot commented on PR #10244: URL: https://github.com/apache/hudi/pull/10244#issuecomment-1840083004 ## CI report: * 2a078bb478c90005df6e65793b969a4a8765f13f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21303) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7059] Hudi filter pushdown for positional merging [hudi]
hudi-bot commented on PR #10167: URL: https://github.com/apache/hudi/pull/10167#issuecomment-1840082759 ## CI report: * 25ee0036413d10722d804e0c935162f2175d7934 UNKNOWN * bd88c58c4ca36bd39363637420cc018deb8bb056 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21197) * b36d6db6ce9750942c8265898f2a6d0e0fd0be6d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7172] Fix the timeline archiver to support concurrent writer [hudi]
hudi-bot commented on PR #10244: URL: https://github.com/apache/hudi/pull/10244#issuecomment-1840076440 ## CI report: * 2a078bb478c90005df6e65793b969a4a8765f13f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]
linliu-code commented on code in PR #10144: URL: https://github.com/apache/hudi/pull/10144#discussion_r1414927204 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala: ## @@ -310,6 +317,15 @@ class HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState, _: PartitionedFile => Iterator.empty } +// Note that for CDC reader, the underlying data schema is stored in the 'options' to separate from the CDC schema. +val rawDataSchemaStr = options.getOrElse(rawDataSchema, "") Review Comment: Changed the code to use table schema instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] `CREATE TABLE ... USING hudi` DDL does not preserve partitioning order when syncing to AWS Glue [hudi]
ad1happy2go commented on issue #10182: URL: https://github.com/apache/hudi/issues/10182#issuecomment-1840059697 @sayanpaul-plaid I will look into it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore [hudi]
ad1happy2go commented on issue #10231: URL: https://github.com/apache/hudi/issues/10231#issuecomment-1840058081 Sure @soumilshah1995. let's connect tomorrow morning on the same. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Large gap between stages on read [hudi]
ad1happy2go commented on issue #10239: URL: https://github.com/apache/hudi/issues/10239#issuecomment-1840055732 @noahtaite Are you setting 'hoodie.metadata.enable' explicitly for the readers. It is by default disabled for the readers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7172) Fix the timeline archiver to support concurrent writer
[ https://issues.apache.org/jira/browse/HUDI-7172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7172: - Labels: pull-request-available (was: ) > Fix the timeline archiver to support concurrent writer > -- > > Key: HUDI-7172 > URL: https://issues.apache.org/jira/browse/HUDI-7172 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7172] Fix the timeline archiver to support concurrent writer [hudi]
danny0405 opened a new pull request, #10244: URL: https://github.com/apache/hudi/pull/10244 ### Change Logs This is a regression of https://github.com/apache/hudi/pull/9209. ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7172) Fix the timeline archiver to support concurrent writer
Danny Chen created HUDI-7172: Summary: Fix the timeline archiver to support concurrent writer Key: HUDI-7172 URL: https://issues.apache.org/jira/browse/HUDI-7172 Project: Apache Hudi Issue Type: Improvement Components: writer-core Reporter: Danny Chen Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [MINOR] Fixing integ test writer for commit time generation [hudi]
hudi-bot commented on PR #10243: URL: https://github.com/apache/hudi/pull/10243#issuecomment-1840022044 ## CI report: * 9492a5b78c6a28dde8b43c6a8e4053020cb11414 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21301) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fixing integ test writer for commit time generation [hudi]
hudi-bot commented on PR #10243: URL: https://github.com/apache/hudi/pull/10243#issuecomment-1840015651 ## CI report: * 9492a5b78c6a28dde8b43c6a8e4053020cb11414 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7171] Fix 'show partitions' not display rewritten partitions [hudi]
hudi-bot commented on PR #10242: URL: https://github.com/apache/hudi/pull/10242#issuecomment-1840015610 ## CI report: * ced3f383b16e16a2593c4c4cdd288e9a5de21a99 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21298) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6979) support EventTimeBasedCompactionStrategy
[ https://issues.apache.org/jira/browse/HUDI-6979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kong Wei updated HUDI-6979: --- Status: In Progress (was: Open) > support EventTimeBasedCompactionStrategy > > > Key: HUDI-6979 > URL: https://issues.apache.org/jira/browse/HUDI-6979 > Project: Apache Hudi > Issue Type: New Feature > Components: compaction >Reporter: Kong Wei >Assignee: Kong Wei >Priority: Major > > The current compaction strategies are based on the logfile size, the number > of logfile files, etc. The data time of the RO table generated by these > strategies is uncontrollable. Hudi also has a DayBased strategy, but it > relies on day based partition path and the time granularity is coarse. > The *EventTimeBasedCompactionStrategy* strategy can generate event > time-friendly RO tables, whether it is day based partition or not. For > example, the strategy can select all logfiles whose data time is before 3 am > for compaction, so that the generated RO table data is before 3 am. If we > just want to query data before 3 am, we can just query the RO table which is > much faster. > With the strategy, I think we can expand the application scenarios of RO > tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-6980] Fixing closing of write client on failure scenarios (#10224)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 0ccd621b258 [HUDI-6980] Fixing closing of write client on failure scenarios (#10224) 0ccd621b258 is described below commit 0ccd621b2582e3d40811dd8b803f072747ffa5c9 Author: Sivabalan Narayanan AuthorDate: Mon Dec 4 20:20:34 2023 -0800 [HUDI-6980] Fixing closing of write client on failure scenarios (#10224) --- .../org/apache/hudi/HoodieSparkSqlWriter.scala | 33 ++ .../timeline/service/handlers/MarkerHandler.java | 4 +-- 2 files changed, 24 insertions(+), 13 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala index 7c4ec8a71e7..bab0448642c 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala @@ -365,7 +365,7 @@ class HoodieSparkSqlWriterInternal { } } - val (writeResult, writeClient: SparkRDDWriteClient[_]) = + val (writeResult: HoodieWriteResult, writeClient: SparkRDDWriteClient[_]) = operation match { case WriteOperationType.DELETE | WriteOperationType.DELETE_PREPPED => mayBeValidateParamsForAutoGenerationOfRecordKeys(parameters, hoodieConfig) @@ -509,9 +509,16 @@ class HoodieSparkSqlWriterInternal { hoodieRecords } client.startCommitWithTime(instantTime, commitActionType) -val writeResult = DataSourceUtils.doWriteOperation(client, dedupedHoodieRecords, instantTime, operation, - preppedSparkSqlWrites || preppedWriteOperation) -(writeResult, client) +try { + val writeResult = DataSourceUtils.doWriteOperation(client, dedupedHoodieRecords, instantTime, operation, +preppedSparkSqlWrites || preppedWriteOperation) + (writeResult, client) +} catch { + case e: HoodieException => +// close the write client in all cases +handleWriteClientClosure(client, tableConfig, parameters, jsc.hadoopConfiguration()) +throw e +} } // Check for errors and commit the write. @@ -524,17 +531,21 @@ class HoodieSparkSqlWriterInternal { (writeSuccessful, common.util.Option.ofNullable(instantTime), compactionInstant, clusteringInstant, writeClient, tableConfig) } finally { -// close the write client in all cases -val asyncCompactionEnabled = isAsyncCompactionEnabled(writeClient, tableConfig, parameters, jsc.hadoopConfiguration()) -val asyncClusteringEnabled = isAsyncClusteringEnabled(writeClient, parameters) -if (!asyncCompactionEnabled && !asyncClusteringEnabled) { - log.info("Closing write client") - writeClient.close() -} +handleWriteClientClosure(writeClient, tableConfig, parameters, jsc.hadoopConfiguration()) } } } + private def handleWriteClientClosure(writeClient: SparkRDDWriteClient[_], tableConfig : HoodieTableConfig, parameters: Map[String, String], configuration: Configuration): Unit = { +// close the write client in all cases +val asyncCompactionEnabled = isAsyncCompactionEnabled(writeClient, tableConfig, parameters, configuration) +val asyncClusteringEnabled = isAsyncClusteringEnabled(writeClient, parameters) +if (!asyncCompactionEnabled && !asyncClusteringEnabled) { + log.warn("Closing write client") + writeClient.close() +} + } + def deduceOperation(hoodieConfig: HoodieConfig, paramsWithoutDefaults : Map[String, String], df: Dataset[Row]): WriteOperationType = { var operation = WriteOperationType.fromValue(hoodieConfig.getString(OPERATION)) // TODO clean up diff --git a/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java b/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java index 390a4e2184f..42e2f40e629 100644 --- a/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java +++ b/hudi-timeline-service/src/main/java/org/apache/hudi/timeline/service/handlers/MarkerHandler.java @@ -126,8 +126,8 @@ public class MarkerHandler extends Handler { if (dispatchingThreadFuture != null) { dispatchingThreadFuture.cancel(true); } -dispatchingExecutorService.shutdown(); -batchingExecutorService.shutdown(); +dispatchingExecutorService.shutdownNow(); +batchingExecutorService.shutdownNow(); } /**
Re: [PR] [HUDI-6980] Fixing closing of write client on failure scenarios [hudi]
nsivabalan merged PR #10224: URL: https://github.com/apache/hudi/pull/10224 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (315924a3b6e -> e2b695abbdf)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 315924a3b6e [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree (#10226) add e2b695abbdf [HUDI-7100] Fixing insert overwrite operations with drop dups config (#10222) No new revisions were added by this update. Summary of changes: .../org/apache/hudi/HoodieSparkSqlWriter.scala | 2 +- .../apache/hudi/functional/TestCOWDataSource.scala | 78 ++ 2 files changed, 79 insertions(+), 1 deletion(-)
Re: [PR] [HUDI-7171] Fix 'show partitions' not display rewritten partitions [hudi]
hudi-bot commented on PR #10242: URL: https://github.com/apache/hudi/pull/10242#issuecomment-1839978754 ## CI report: * ced3f383b16e16a2593c4c4cdd288e9a5de21a99 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7100] Fixing insert overwrite operations with drop dups config [hudi]
nsivabalan merged PR #10222: URL: https://github.com/apache/hudi/pull/10222 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fixing view manager reuse with Embedded timeline server [hudi]
hudi-bot commented on PR #10240: URL: https://github.com/apache/hudi/pull/10240#issuecomment-1839978716 ## CI report: * aa4b0228d2c9dba581dba3c5ec01f2893aa0b6ed Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21295) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]
hudi-bot commented on PR #10144: URL: https://github.com/apache/hudi/pull/10144#issuecomment-1839978473 ## CI report: * 363470311395f04bdd0462bc058a9b25bd94bc9f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21294) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [MINOR] Fixing integ test writer for commit time generation [hudi]
nsivabalan opened a new pull request, #10243: URL: https://github.com/apache/hudi/pull/10243 ### Change Logs Fixing integ test writer for commit time generation ### Impact Fixing integ test writer for commit time generation ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7171) Fix 'show partitions' not display rewritten partitions
[ https://issues.apache.org/jira/browse/HUDI-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7171: - Labels: pull-request-available (was: ) > Fix 'show partitions' not display rewritten partitions > -- > > Key: HUDI-7171 > URL: https://issues.apache.org/jira/browse/HUDI-7171 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wechar >Assignee: Wechar >Priority: Major > Labels: pull-request-available > > `show partitions` sql can not return correct result in following two cases: > # the dropped partitions should be displayed after they were recreated. > # after `insert overwrite` a partitioned table, the partitions should be > marked as `dropped` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7171] Fix 'show partitions' not display rewritten partitions [hudi]
wecharyu opened a new pull request, #10242: URL: https://github.com/apache/hudi/pull/10242 ### Change Logs `show partitions` sql can not return correct result in following two cases: 1. the dropped partitions should be displayed after they were recreated. 2. after `insert overwrite` a partitioned table, the partitions should be marked as `dropped` ### Impact bug fix. ### Risk level (write none, low medium or high below) None. ### Documentation Update None. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5823][RFC-65] RFC for Partition Lifecycle Management [hudi]
stream2000 commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1414833910 ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. Review Comment: Again, leverage `.hoodie_partition_metadata` will bring format change, and it doesn't support any kind of transaction currently. As discussed with @danny0405 , In 1.0.0 and later version which supports efficient completion time queries on the timeline(#9565), we will have a more elegant way to get the `lastCommitTime`. You can see the the updated RFC for details. ## rfc/rfc-65/rfc-65.md: ## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-582
[jira] [Created] (HUDI-7171) Fix 'show partitions' not display rewritten partitions
Wechar created HUDI-7171: Summary: Fix 'show partitions' not display rewritten partitions Key: HUDI-7171 URL: https://issues.apache.org/jira/browse/HUDI-7171 Project: Apache Hudi Issue Type: Bug Reporter: Wechar Assignee: Wechar `show partitions` sql can not return correct result in following two cases: # the dropped partitions should be displayed after they were recreated. # after `insert overwrite` a partitioned table, the partitions should be marked as `dropped` -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] `CREATE TABLE ... USING hudi` DDL does not preserve partitioning order when syncing to AWS Glue [hudi]
nfarah86 commented on issue #10182: URL: https://github.com/apache/hudi/issues/10182#issuecomment-1839954297 tagging @ad1happy2go -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7100] Fixing insert overwrite operations with drop dups config [hudi]
hudi-bot commented on PR #10222: URL: https://github.com/apache/hudi/pull/10222#issuecomment-1839922168 ## CI report: * 159b36a0f851c729e3ac7d690f2e0963dd17f85d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21293) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7170) Implement HFile reader independent of HBase
[ https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7170: - Labels: pull-request-available (was: ) > Implement HFile reader independent of HBase > --- > > Key: HUDI-7170 > URL: https://issues.apache.org/jira/browse/HUDI-7170 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > > We'd like to provide our own implementation o HFile reader which does not use > HBase dependencies. In the long term, we should also decouple the HFile > reader from hadoop FileSystem abstractions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7170) Implement HFile reader independent of HBase
[ https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7170: Summary: Implement HFile reader independent of HBase (was: Add HFile reader independent of HBase) > Implement HFile reader independent of HBase > --- > > Key: HUDI-7170 > URL: https://issues.apache.org/jira/browse/HUDI-7170 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > > We'd like to provide our own implementation o HFile reader which does not use > HBase dependencies. In the long term, we should also decouple the HFile > reader from hadoop FileSystem abstractions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7170][WIP] Implement HFile reader independent of HBase [hudi]
yihua opened a new pull request, #10241: URL: https://github.com/apache/hudi/pull/10241 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7170) Add HFile reader independent of HBase
Ethan Guo created HUDI-7170: --- Summary: Add HFile reader independent of HBase Key: HUDI-7170 URL: https://issues.apache.org/jira/browse/HUDI-7170 Project: Apache Hudi Issue Type: New Feature Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7170) Add HFile reader independent of HBase
[ https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7170: Description: We'd like to provide our own implementation o HFile reader which does not use HBase dependencies. In the long term, we should also decouple the HFile reader from hadoop FileSystem abstractions. > Add HFile reader independent of HBase > - > > Key: HUDI-7170 > URL: https://issues.apache.org/jira/browse/HUDI-7170 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > > We'd like to provide our own implementation o HFile reader which does not use > HBase dependencies. In the long term, we should also decouple the HFile > reader from hadoop FileSystem abstractions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7170) Add HFile reader independent of HBase
[ https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-7170: --- Assignee: Ethan Guo > Add HFile reader independent of HBase > - > > Key: HUDI-7170 > URL: https://issues.apache.org/jira/browse/HUDI-7170 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7170) Add HFile reader independent of HBase
[ https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7170: Fix Version/s: 1.0.0 > Add HFile reader independent of HBase > - > > Key: HUDI-7170 > URL: https://issues.apache.org/jira/browse/HUDI-7170 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7170) Add HFile reader independent of HBase
[ https://issues.apache.org/jira/browse/HUDI-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7170: Priority: Blocker (was: Major) > Add HFile reader independent of HBase > - > > Key: HUDI-7170 > URL: https://issues.apache.org/jira/browse/HUDI-7170 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7166) Provide a Procedure to Calculate Column Stats Overlap Degree
[ https://issues.apache.org/jira/browse/HUDI-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-7166. Resolution: Fixed Fixed via master branch: 315924a3b6e2430be1c5662bacb696c8deae > Provide a Procedure to Calculate Column Stats Overlap Degree > > > Key: HUDI-7166 > URL: https://issues.apache.org/jira/browse/HUDI-7166 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ma Jian >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > In HUDI-7110 , a tool has been made available to display column stats. > However, this tool is not very user-friendly for manual observation when > dealing with large data volumes. For instance, with tens of thousands of > parquet files, the number of rows in column stats could be of the order of > hundreds of thousands. This renders the data virtually unreadable to humans, > necessitating further processing by code. Yet, if an administrator simply > wishes to directly observe the data layout based on column stats under such > circumstances, a more intuitive display tool is required. Here, we offer a > tool that calculates the overlap degree of column stats based on partition > and column name. > > Overlap degree refers to the extent to which the min-max ranges of different > files intersect with each other. This directly affects the effectiveness of > data skipping. > > In fact, a similar concept is also provided by Snowflake to aid their > clustering process. > [https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions] > Our implementation here is not overly complex. > > It yields output similar to the following: > |Partition path|Field name|Average overlap|Maximum file overlap|Total file > number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| | > |path|c8|1.33|2|2|1|1|1|1|3| | > This content provides a straightforward representation of the relevant > statistics. > > For example, consider three files: a.parquet, b.parquet, and c.parquet. > Taking an integer-type column 'id' as an example, the range (min-max) for 'a' > is 1–5, for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap > within the ranges 3–5 and 7. > If the filter conditions for 'id' during data skipping include these values, > multiple files will be filtered out. For a simpler case, if it's an equality > query, 2 files will be filtered within these ranges, and no more than one > file will be filtered in other cases (possibly outside of the range). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7166) Provide a Procedure to Calculate Column Stats Overlap Degree
[ https://issues.apache.org/jira/browse/HUDI-7166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7166: - Fix Version/s: 1.0.0 > Provide a Procedure to Calculate Column Stats Overlap Degree > > > Key: HUDI-7166 > URL: https://issues.apache.org/jira/browse/HUDI-7166 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ma Jian >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > In HUDI-7110 , a tool has been made available to display column stats. > However, this tool is not very user-friendly for manual observation when > dealing with large data volumes. For instance, with tens of thousands of > parquet files, the number of rows in column stats could be of the order of > hundreds of thousands. This renders the data virtually unreadable to humans, > necessitating further processing by code. Yet, if an administrator simply > wishes to directly observe the data layout based on column stats under such > circumstances, a more intuitive display tool is required. Here, we offer a > tool that calculates the overlap degree of column stats based on partition > and column name. > > Overlap degree refers to the extent to which the min-max ranges of different > files intersect with each other. This directly affects the effectiveness of > data skipping. > > In fact, a similar concept is also provided by Snowflake to aid their > clustering process. > [https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions] > Our implementation here is not overly complex. > > It yields output similar to the following: > |Partition path|Field name|Average overlap|Maximum file overlap|Total file > number|50% overlap|75% overlap|95% overlap|99% overlap|Total value number| | > |path|c8|1.33|2|2|1|1|1|1|3| | > This content provides a straightforward representation of the relevant > statistics. > > For example, consider three files: a.parquet, b.parquet, and c.parquet. > Taking an integer-type column 'id' as an example, the range (min-max) for 'a' > is 1–5, for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap > within the ranges 3–5 and 7. > If the filter conditions for 'id' during data skipping include these values, > multiple files will be filtered out. For a simpler case, if it's an equality > query, 2 files will be filtered within these ranges, and no more than one > file will be filtered in other cases (possibly outside of the range). -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated (92fc0c09192 -> 315924a3b6e)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 92fc0c09192 [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events (#10230) add 315924a3b6e [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree (#10226) No new revisions were added by this update. Summary of changes: .../hudi/metadata/HoodieTableMetadataUtil.java | 28 ++ .../hudi/command/procedures/HoodieProcedures.scala | 1 + .../ShowColumnStatsOverlapProcedure.scala | 338 + .../sql/hudi/procedure/TestMetadataProcedure.scala | 57 4 files changed, 424 insertions(+) create mode 100644 hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
danny0405 merged PR #10226: URL: https://github.com/apache/hudi/pull/10226 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events (#10230)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 92fc0c09192 [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events (#10230) 92fc0c09192 is described below commit 92fc0c0919278b6e43a7c45b92c80be7a39525ec Author: ksmou <135721692+ks...@users.noreply.github.com> AuthorDate: Tue Dec 5 10:29:29 2023 +0800 [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events (#10230) --- .../sink/TestStreamWriteOperatorCoordinator.java | 38 ++ 1 file changed, 38 insertions(+) diff --git a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java index 0f3d1947128..5cbe9899b8d 100644 --- a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java +++ b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java @@ -19,7 +19,9 @@ package org.apache.hudi.sink; import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.client.heartbeat.HoodieHeartbeatClient; import org.apache.hudi.common.fs.FSUtils; +import org.apache.hudi.common.fs.HoodieWrapperFileSystem; import org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy; import org.apache.hudi.common.model.HoodieWriteStat; import org.apache.hudi.common.model.WriteConcurrencyMode; @@ -65,7 +67,9 @@ import static org.hamcrest.CoreMatchers.is; import static org.hamcrest.CoreMatchers.startsWith; import static org.hamcrest.MatcherAssert.assertThat; import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; +import static org.junit.jupiter.api.Assertions.assertFalse; import static org.junit.jupiter.api.Assertions.assertNotEquals; +import static org.junit.jupiter.api.Assertions.assertNotNull; import static org.junit.jupiter.api.Assertions.assertNull; import static org.junit.jupiter.api.Assertions.assertTrue; @@ -185,6 +189,40 @@ public class TestStreamWriteOperatorCoordinator { assertThat("Recommits the instant with partial uncommitted events", lastCompleted, is(instant)); } + @Test + public void testStopHeartbeatForUncommittedEventWithLazyCleanPolicy() throws Exception { +// reset +reset(); +// override the default configuration +Configuration conf = TestConfigurations.getDefaultConf(tempFile.getAbsolutePath()); +conf.setString(HoodieCleanConfig.FAILED_WRITES_CLEANER_POLICY.key(), HoodieFailedWritesCleaningPolicy.LAZY.name()); +OperatorCoordinator.Context context = new MockOperatorCoordinatorContext(new OperatorID(), 1); +coordinator = new StreamWriteOperatorCoordinator(conf, context); +coordinator.start(); +coordinator.setExecutor(new MockCoordinatorExecutor(context)); + + assertTrue(coordinator.getWriteClient().getConfig().getFailedWritesCleanPolicy().isLazy()); + +final WriteMetadataEvent event0 = WriteMetadataEvent.emptyBootstrap(0); + +// start one instant and not commit it +coordinator.handleEventFromOperator(0, event0); +String instant = coordinator.getInstant(); +HoodieHeartbeatClient heartbeatClient = coordinator.getWriteClient().getHeartbeatClient(); +assertNotNull(heartbeatClient.getHeartbeat(instant), "Heartbeat is missing"); + +String basePath = tempFile.getAbsolutePath(); +HoodieWrapperFileSystem fs = coordinator.getWriteClient().getHoodieTable().getMetaClient().getFs(); + +assertTrue(HoodieHeartbeatClient.heartbeatExists(fs, basePath, instant), "Heartbeat is existed"); + +// send bootstrap event to stop the heartbeat for this instant +WriteMetadataEvent event1 = WriteMetadataEvent.emptyBootstrap(0); +coordinator.handleEventFromOperator(0, event1); + +assertFalse(HoodieHeartbeatClient.heartbeatExists(fs, basePath, instant), "Heartbeat is stopped and cleared"); + } + @Test public void testRecommitWithLazyFailedWritesCleanPolicy() { coordinator.getWriteClient().getConfig().setValue(HoodieCleanConfig.FAILED_WRITES_CLEANER_POLICY, HoodieFailedWritesCleaningPolicy.LAZY.name());
Re: [PR] [HUDI-7165][FOLLOW-UP] Add test case for stopping heartbeat for un-committed events [hudi]
danny0405 merged PR #10230: URL: https://github.com/apache/hudi/pull/10230 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fixing view manager reuse with Embedded timeline server [hudi]
hudi-bot commented on PR #10240: URL: https://github.com/apache/hudi/pull/10240#issuecomment-1839837203 ## CI report: * aa4b0228d2c9dba581dba3c5ec01f2893aa0b6ed Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21295) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]
hudi-bot commented on PR #10144: URL: https://github.com/apache/hudi/pull/10144#issuecomment-1839836771 ## CI report: * 1027df0a1aa63ef976ce5ba4494af252ff8faed0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21281) * 363470311395f04bdd0462bc058a9b25bd94bc9f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21294) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Fixing view manager reuse with Embedded timeline server [hudi]
hudi-bot commented on PR #10240: URL: https://github.com/apache/hudi/pull/10240#issuecomment-1839829542 ## CI report: * aa4b0228d2c9dba581dba3c5ec01f2893aa0b6ed UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]
hudi-bot commented on PR #10144: URL: https://github.com/apache/hudi/pull/10144#issuecomment-1839829264 ## CI report: * 1027df0a1aa63ef976ce5ba4494af252ff8faed0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21281) * 363470311395f04bdd0462bc058a9b25bd94bc9f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7100] Fixing insert overwrite operations with drop dups config [hudi]
hudi-bot commented on PR #10222: URL: https://github.com/apache/hudi/pull/10222#issuecomment-1839823227 ## CI report: * b53a22922751b4744c96c07666bcb5ba13e2cb60 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21256) * 159b36a0f851c729e3ac7d690f2e0963dd17f85d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21293) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [MINOR] Fixing view manager reuse with Embedded timeline server [hudi]
nsivabalan opened a new pull request, #10240: URL: https://github.com/apache/hudi/pull/10240 ### Change Logs Fixing view manager reuse with Embedded timeline server ### Impact Fixing view manager reuse with Embedded timeline server ### Risk level (write none, low medium or high below) low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]
linliu-code commented on code in PR #10144: URL: https://github.com/apache/hudi/pull/10144#discussion_r1414693685 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala: ## @@ -310,6 +317,15 @@ class HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState, _: PartitionedFile => Iterator.empty } +// Note that for CDC reader, the underlying data schema is stored in the 'options' to separate from the CDC schema. +val rawDataSchemaStr = options.getOrElse(rawDataSchema, "") Review Comment: Good question, we can read the schema from here directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]
linliu-code commented on code in PR #10144: URL: https://github.com/apache/hudi/pull/10144#discussion_r1414691276 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCDCFileIndex.scala: ## @@ -42,29 +42,34 @@ class HoodieCDCFileIndex (override val spark: SparkSession, extends HoodieIncrementalFileIndex( spark, metaClient, schemaSpec, options, fileStatusCache, includeLogFiles, shouldEmbedFileSlices ) with FileIndex { + private val emptyPartitionPath: String = "empty_partition_path"; val cdcRelation: CDCRelation = CDCRelation.getCDCRelation(spark.sqlContext, metaClient, options) val cdcExtractor: HoodieCDCExtractor = cdcRelation.cdcExtractor override def listFiles(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[PartitionDirectory] = { -val partitionToFileGroups = cdcExtractor.extractCDCFileSplits().asScala.groupBy(_._1.getPartitionPath).toSeq -partitionToFileGroups.map { - case (partitionPath, fileGroups) => -val fileGroupIds: List[FileStatus] = fileGroups.map { fileGroup => { - // We create a fake FileStatus to wrap the information of HoodieFileGroupId, which are used - // later to retrieve the corresponding CDC file group splits. - val fileGroupId: HoodieFileGroupId = fileGroup._1 - new FileStatus(0, true, 0, 0, 0, -0, null, "", "", null, -new Path(fileGroupId.getPartitionPath, fileGroupId.getFileId)) -}}.toList -val partitionValues: InternalRow = new GenericInternalRow(doParsePartitionColumnValues( - metaClient.getTableConfig.getPartitionFields.get(), partitionPath).asInstanceOf[Array[Any]]) +cdcExtractor.extractCDCFileSplits().asScala.map { + case (fileGroupId, fileSplits) => +val partitionPath = if (fileGroupId.getPartitionPath.isEmpty) emptyPartitionPath else fileGroupId.getPartitionPath Review Comment: Here we cannot use empty string since Line 63 requires the partition_path to be not empty. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7100] Fixing insert overwrite operations with drop dups config [hudi]
hudi-bot commented on PR #10222: URL: https://github.com/apache/hudi/pull/10222#issuecomment-1839787728 ## CI report: * b53a22922751b4744c96c07666bcb5ba13e2cb60 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21256) * 159b36a0f851c729e3ac7d690f2e0963dd17f85d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6980] Fixing closing of write client on failure scenarios [hudi]
hudi-bot commented on PR #10224: URL: https://github.com/apache/hudi/pull/10224#issuecomment-1839680195 ## CI report: * c537ff2d4e35cdb4e0f1086ee991d2f9cf53cfef Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21292) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Data loss in MOR table after clustering partition [hudi]
mzheng-plaid commented on issue #9977: URL: https://github.com/apache/hudi/issues/9977#issuecomment-1839667223 @ad1happy2go @codope I was able to reproduce with the following Spark code (5 row dataset). It seems the problem is related to handling of array fields in structs. Could you confirm if you're able to reproduce using this code? ``` from pyspark.sql.types import StringType from pyspark.sql import functions as F from pyspark.sql import types as T import uuid from pyspark.sql import Row import random hudi_options = { "hoodie.table.name": "clustering_bug_test", "hoodie.datasource.write.recordkey.field": "id.value", "hoodie.datasource.write.partitionpath.field": "partition:SIMPLE", "hoodie.datasource.write.table.name": "clustering_bug_test", "hoodie.datasource.write.table.type": "MERGE_ON_READ", "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.precombine.field": "publishedAtUnixNano", "hoodie.datasource.write.payload.class": "org.apache.hudi.common.model.DefaultHoodieRecordPayload", "hoodie.compaction.payload.class": "org.apache.hudi.common.model.DefaultHoodieRecordPayload", # Turn off small file optimizations "hoodie.parquet.small.file.limit": "0", # Turn off metadata table "hoodie.metadata.enable": "false", "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.CustomKeyGenerator", # Hive style partitioning "hoodie.datasource.write.hive_style_partitioning": "true", 'hoodie.cleaner.commits.retained': 1, "hoodie.bootstrap.index.enable": "false", 'hoodie.commits.archival.batch': 5, # Bloom filter "hoodie.index.type": "BLOOM", 'hoodie.bloom.index.prune.by.ranges': 'false', } clustering_hudi_options = { **hudi_options, "hoodie.clustering.inline": "true", "hoodie.clustering.inline.max.commits": 1, "hoodie.clustering.plan.strategy.small.file.limit": 256 * 1024 * 1024, "hoodie.clustering.plan.strategy.target.file.max.bytes": 512 * 1024 * 1024, "hoodie.clustering.plan.strategy.sort.columns": "id.value", "hoodie.clustering.plan.strategy.max.num.groups": 300, } random.seed(10) dummy_data = [ Row( id=Row(value=str(uuid.uuid4())), publishedAtUnixNano=i, partition="1", struct_array_column=Row( element=[str(random.randint(0, 100)) for i in range(random.randint(1, 100))], ), struct_column=Row( nested_array_column=Row( element=[str(random.randint(0, 100)) for i in range(random.randint(1, 100))], ), ), # This padding ensures files are large enough to reproduce the data loss **{ f"col_{i}": str(uuid.uuid4()) for i in range(100) }, ) for i in range(5) ] df_dummy = spark.createDataFrame(dummy_data) # This was tested in S3 PATH = f"{OUTPUT_PATH}" df_dummy.write.format("hudi").options(**hudi_options).mode("append").save(PATH) read_df = spark.read.format("hudi").load(PATH) data = read_df.take(1) init_count = read_df.count() # This upsert should be a no-op (re-writing 1 existing row) upsert_df = spark.createDataFrame(data, read_df.schema) upsert_df.write.format("hudi").options(**clustering_hudi_options).mode("append").save(PATH) read_df = spark.read.format("hudi").load(PATH) final_count = read_df.count() print(f"{init_count}, {final_count}") ``` The schema is: ``` root |-- id: struct (nullable = true) ||-- value: string (nullable = true) |-- publishedAtUnixNano: long (nullable = true) |-- partition: string (nullable = true) |-- struct_array_column: struct (nullable = true) ||-- element: array (nullable = true) |||-- element: string (containsNull = true) |-- struct_column: struct (nullable = true) ||-- nested_array_column: struct (nullable = true) |||-- element: array (nullable = true) ||||-- element: string (containsNull = true) |-- col_0: string (nullable = true) |-- col_1: string (nullable = true) |-- col_2: string (nullable = true) |-- col_3: string (nullable = true) ... ``` We expect init_count and final_count to be the same but it's actually (may vary) ``` 5, 48000 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7125] Fix bugs for CDC queries [hudi]
yihua commented on code in PR #10144: URL: https://github.com/apache/hudi/pull/10144#discussion_r1414533367 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala: ## @@ -310,6 +317,15 @@ class HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState, _: PartitionedFile => Iterator.empty } +// Note that for CDC reader, the underlying data schema is stored in the 'options' to separate from the CDC schema. +val rawDataSchemaStr = options.getOrElse(rawDataSchema, "") Review Comment: `rawDataSchemaStr` is the table schema. Can the table schema be directly read here instead of being passed in? Does the `tableSchema` represent the actual data schema? ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCDCFileIndex.scala: ## @@ -42,29 +42,34 @@ class HoodieCDCFileIndex (override val spark: SparkSession, extends HoodieIncrementalFileIndex( spark, metaClient, schemaSpec, options, fileStatusCache, includeLogFiles, shouldEmbedFileSlices ) with FileIndex { + private val emptyPartitionPath: String = "empty_partition_path"; val cdcRelation: CDCRelation = CDCRelation.getCDCRelation(spark.sqlContext, metaClient, options) val cdcExtractor: HoodieCDCExtractor = cdcRelation.cdcExtractor override def listFiles(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[PartitionDirectory] = { -val partitionToFileGroups = cdcExtractor.extractCDCFileSplits().asScala.groupBy(_._1.getPartitionPath).toSeq -partitionToFileGroups.map { - case (partitionPath, fileGroups) => -val fileGroupIds: List[FileStatus] = fileGroups.map { fileGroup => { - // We create a fake FileStatus to wrap the information of HoodieFileGroupId, which are used - // later to retrieve the corresponding CDC file group splits. - val fileGroupId: HoodieFileGroupId = fileGroup._1 - new FileStatus(0, true, 0, 0, 0, -0, null, "", "", null, -new Path(fileGroupId.getPartitionPath, fileGroupId.getFileId)) -}}.toList -val partitionValues: InternalRow = new GenericInternalRow(doParsePartitionColumnValues( - metaClient.getTableConfig.getPartitionFields.get(), partitionPath).asInstanceOf[Array[Any]]) +cdcExtractor.extractCDCFileSplits().asScala.map { + case (fileGroupId, fileSplits) => +val partitionPath = if (fileGroupId.getPartitionPath.isEmpty) emptyPartitionPath else fileGroupId.getPartitionPath Review Comment: using empty String instead for non-partitioned table? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Allow removal of column stats from metadata table for externally created files [hudi]
hudi-bot commented on PR #10238: URL: https://github.com/apache/hudi/pull/10238#issuecomment-1839467433 ## CI report: * fce0e1eb204f4377fb9f307168b43017d3acf73d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21291) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6980] Fixing closing of write client on failure scenarios [hudi]
hudi-bot commented on PR #10224: URL: https://github.com/apache/hudi/pull/10224#issuecomment-1839408514 ## CI report: * 05e298ae8c265de111cababf120f194d960f0472 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21261) * c537ff2d4e35cdb4e0f1086ee991d2f9cf53cfef Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21292) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6980] Fixing closing of write client on failure scenarios [hudi]
hudi-bot commented on PR #10224: URL: https://github.com/apache/hudi/pull/10224#issuecomment-1839397147 ## CI report: * 05e298ae8c265de111cababf120f194d960f0472 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21261) * c537ff2d4e35cdb4e0f1086ee991d2f9cf53cfef UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Large gap between stages on read [hudi]
noahtaite commented on issue #10239: URL: https://github.com/apache/hudi/issues/10239#issuecomment-1839395374 Stage 1 stack trace: ``` org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45) org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:103) org.apache.hudi.metadata.FileSystemBackedTableMetadata.getAllFilesInPartitions(FileSystemBackedTableMetadata.java:157) org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPathFiles(BaseHoodieTableFileIndex.java:358) org.apache.hudi.BaseHoodieTableFileIndex.loadFileSlicesForPartitions(BaseHoodieTableFileIndex.java:249) org.apache.hudi.BaseHoodieTableFileIndex.ensurePreloadedPartitions(BaseHoodieTableFileIndex.java:241) org.apache.hudi.BaseHoodieTableFileIndex.getInputFileSlices(BaseHoodieTableFileIndex.java:227) org.apache.hudi.SparkHoodieTableFileIndex.listFileSlices(SparkHoodieTableFileIndex.scala:172) org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:223) org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:65) org .apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:353) org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$4(DataSourceStrategy.scala:365) org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$pruneFilterProject$1(DataSourceStrategy.scala:399) org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:478) org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:398) org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:365) org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) ``` -- Large Gap -- Stage 2 stack trace: ``` org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45) org.apache.hudi.client.common.HoodieSparkEngineContext.flatMap(HoodieSparkEngineContext.java:137) org.apache.hudi.metadata.FileSystemBackedTableMetadata.getPartitionPathWithPathPrefix(FileSystemBackedTableMetadata.java:109) org.apache.hudi.metadata.FileSystemBackedTableMetadata.lambda$getPartitionPathWithPathPrefixes$0(FileSystemBackedTableMetadata.java:91) java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:269) java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) org.apache.hudi.metadata.FileSystemBackedTa bleMetadata.getPartitionPathWithPathPrefixes(FileSystemBackedTableMetadata.java:95) org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPaths(BaseHoodieTableFileIndex.java:281) org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:206) org.apache.hudi.SparkHoodieTableFileIndex.listMatchingPartitionPaths(SparkHoodieTableFileIndex.scala:205) org.apache.hudi.SparkHoodieTableFileIndex.listFileSlices(SparkHoodieTableFileIndex.scala:171) org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:223) org.apache.hudi.BaseMergeOnReadSnapshotRelation.collectFileSplits(MergeOnReadSnapshotRelation.scala:65) org.apache.hudi.HoodieBaseRelation.buildScan(HoodieBaseRelation.scala:353) org.apache.spark.sql.execution.datasources.DataSourceStrategy$.$anonfun$apply$4(DataSourceStrategy.scala:365) ``` Based on some digging in the code, I believe FileSystemBackedTableMetadata implies that my Hoodie metadata table isn't being referenced correctly. Trying to dig into my metadata stats to confirm this. My readers should be using "hoodie.metadata.enable" by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]
noahtaite commented on issue #10172: URL: https://github.com/apache/hudi/issues/10172#issuecomment-1839189354 Bump... I think data inconsistency after clustering should be treated as a critical priority investigation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] Large gap between stages on read [hudi]
noahtaite opened a new issue, #10239: URL: https://github.com/apache/hudi/issues/10239 **Describe the problem you faced** I have multiple applications reading our 120 table, 1PB+ Hudi OLAP data lake that are seeing gaps of 1hr+ in our application stages when collecting the data: https://github.com/apache/hudi/assets/24283126/a03dd51b-5f0e-4214-a731-2bf81da95926";> Note a 1hr gap between stages 12 + 13 I have been able to consistently reproduce this in my dev environment and see the following behaviour: - Calling .load() on the table finishes quickly. - Calling .count() on a specific partition has all jobs in the Spark History Server complete in under 10 minutes, but then a 1hr gap is observed before the output of the count is reported. - During the gap, my cluster auto-scales down to 1 executor **To Reproduce** Steps to reproduce the behavior: 1. 20TB+ Hudi table with ~250k partitions, metadata enabled. 2. Load + count a single partition. 3. Observe a large gap when just a single executor is running. 4. Slow read performance. **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.13.1 * Spark version : 3.4.0 * Hive version : 3.1.3 * Hadoop version : 3.3.3 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : No **Additional context** **Stacktrace** I'm just trying to gain a base level understanding of where this time is going or if someone can point me in the correct direction for troubleshooting. The runtime cost is quite low due to the scaling down but analytics developers are not happy with their applications slowing down. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Allow removal of column stats from metadata table for externally created files [hudi]
hudi-bot commented on PR #10238: URL: https://github.com/apache/hudi/pull/10238#issuecomment-1839112667 ## CI report: * fce0e1eb204f4377fb9f307168b43017d3acf73d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21291) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [MINOR] Allow removal of column stats from metadata table for externally created files [hudi]
hudi-bot commented on PR #10238: URL: https://github.com/apache/hudi/pull/10238#issuecomment-1839098202 ## CI report: * fce0e1eb204f4377fb9f307168b43017d3acf73d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Clean action failure triggers an exception while trying to check whether metadata is a table [hudi]
shubhamn21 commented on issue #10127: URL: https://github.com/apache/hudi/issues/10127#issuecomment-1839090525 `23/12/04 08:00:23 WARN CleanActionExecutor: Failed to perform previous clean operation, instant: [==>20231204075005981__clean__INFLIGHT] java.lang.IllegalArgumentException at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:31)` Hi @nsivabalan , Tagging you here as I had seen you as an assignee for similar [issue](https://github.com/apache/hudi/issues/6463). I am seeing the above clean action warning which prompts subsequent failures in job. Has this got something to do with s3 performance? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Handling of DELETE operation using Debezium Kafka connector [hudi]
ad1happy2go commented on issue #10181: URL: https://github.com/apache/hudi/issues/10181#issuecomment-1839076172 @seethb Full Details on this similar issue - https://github.com/apache/hudi/issues/9143 Go over it and let us know in case you have any doubts. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [MINOR] Allow removal of column stats from metadata table for externally created files [hudi]
the-other-tim-brown opened a new pull request, #10238: URL: https://github.com/apache/hudi/pull/10238 ### Change Logs For files that are not created by Hudi but added to the table (zero copy bootstrap or OneTable case) we are unable to remove the column stats after these files are removed from the table view. ### Impact Allows proper cleanup of metadata table's stats partition ### Risk level (write none, low medium or high below) None ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore [hudi]
soumilshah1995 commented on issue #10231: URL: https://github.com/apache/hudi/issues/10231#issuecomment-1839051919 @ad1happy2go you think you can help me setup Derby megastore I believe I already have I am confused on steps I would appreciate if we can catchup on slack and help me understand this a bit. It would be great opportunity for me to learn and pass on knowledge down -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore [hudi]
ad1happy2go commented on issue #10231: URL: https://github.com/apache/hudi/issues/10231#issuecomment-1839047567 You can find the hive scripts here - https://github.com/apache/hive/tree/master/metastore/scripts/upgrade -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Issue with Hudi Hive Sync Tool with Hive MetaStore [hudi]
ad1happy2go commented on issue #10231: URL: https://github.com/apache/hudi/issues/10231#issuecomment-1839046110 @soumilshah1995 Have you configured the external metastore? We need to setup the hive metastore tables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] What is the priority for the parameter settings of hudi to take effect [hudi]
ad1happy2go commented on issue #10236: URL: https://github.com/apache/hudi/issues/10236#issuecomment-1839001438 @JoshuaZhuCN The order you placed is correct. Keep in mind, tblproperties only get used in write path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-7154) Hudi Streamer with row writer enabled hits NPE with empty batch
[ https://issues.apache.org/jira/browse/HUDI-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-7154: - Assignee: sivabalan narayanan > Hudi Streamer with row writer enabled hits NPE with empty batch > --- > > Key: HUDI-7154 > URL: https://issues.apache.org/jira/browse/HUDI-7154 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.1 > > > Hudi Streamer with row writer enabled hits NPE with empty batch (the > checkpoint has advanced) > {code:java} > java.lang.NullPointerException > at > org.apache.hudi.HoodieSparkSqlWriter$.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala:1190) > at > org.apache.hudi.HoodieSparkSqlWriter.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala) > at > org.apache.hudi.utilities.streamer.StreamSync.prepareHoodieConfigForRowWriter(StreamSync.java:801) > at > org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:939) > at > org.apache.hudi.utilities.streamer.StreamSync.writeToSinkAndDoMetaSync(StreamSync.java:819) > at > org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:458) > at > org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.ingestOnce(HoodieStreamer.java:850) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7154) Hudi Streamer with row writer enabled hits NPE with empty batch
[ https://issues.apache.org/jira/browse/HUDI-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan closed HUDI-7154. - Resolution: Fixed > Hudi Streamer with row writer enabled hits NPE with empty batch > --- > > Key: HUDI-7154 > URL: https://issues.apache.org/jira/browse/HUDI-7154 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.1 > > > Hudi Streamer with row writer enabled hits NPE with empty batch (the > checkpoint has advanced) > {code:java} > java.lang.NullPointerException > at > org.apache.hudi.HoodieSparkSqlWriter$.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala:1190) > at > org.apache.hudi.HoodieSparkSqlWriter.getBulkInsertRowConfig(HoodieSparkSqlWriter.scala) > at > org.apache.hudi.utilities.streamer.StreamSync.prepareHoodieConfigForRowWriter(StreamSync.java:801) > at > org.apache.hudi.utilities.streamer.StreamSync.writeToSink(StreamSync.java:939) > at > org.apache.hudi.utilities.streamer.StreamSync.writeToSinkAndDoMetaSync(StreamSync.java:819) > at > org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:458) > at > org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.ingestOnce(HoodieStreamer.java:850) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
(hudi) branch master updated: [HUDI-6822] Fix deletes handling in hbase index when partition path is updated (#9630)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new b9fb9f616e6 [HUDI-6822] Fix deletes handling in hbase index when partition path is updated (#9630) b9fb9f616e6 is described below commit b9fb9f616e6585b5e92f796e50ef93747d38fb49 Author: flashJd AuthorDate: Tue Dec 5 00:08:35 2023 +0800 [HUDI-6822] Fix deletes handling in hbase index when partition path is updated (#9630) - Co-authored-by: Balaji Varadarajan --- .../org/apache/hudi/index/HoodieIndexUtils.java| 1 + .../metadata/HoodieBackedTableMetadataWriter.java | 68 +--- .../hudi/index/hbase/SparkHoodieHBaseIndex.java| 4 + .../index/hbase/TestSparkHoodieHBaseIndex.java | 95 ++ .../org/apache/hudi/common/model/HoodieRecord.java | 23 +- .../hudi/common/model/HoodieRecordDelegate.java| 32 ++-- .../model/TestHoodieRecordSerialization.scala | 12 +-- 7 files changed, 140 insertions(+), 95 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java index 33e8d501943..de3d181ad06 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java @@ -323,6 +323,7 @@ public class HoodieIndexUtils { } else { // merged record has a different partition: issue a delete to the old partition and insert the merged record to the new partition HoodieRecord deleteRecord = createDeleteRecord(config, existing.getKey()); +deleteRecord.setIgnoreIndexUpdate(true); return Arrays.asList(tagRecord(deleteRecord, existing.getCurrentLocation()), merged).iterator(); } }); diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java index ecdf93eda1d..781a9024117 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java @@ -29,10 +29,8 @@ import org.apache.hudi.client.WriteStatus; import org.apache.hudi.common.config.HoodieMetadataConfig; import org.apache.hudi.common.config.SerializableConfiguration; import org.apache.hudi.common.data.HoodieData; -import org.apache.hudi.common.data.HoodiePairData; import org.apache.hudi.common.engine.HoodieEngineContext; import org.apache.hudi.common.fs.FSUtils; -import org.apache.hudi.common.function.SerializableFunction; import org.apache.hudi.common.model.FileSlice; import org.apache.hudi.common.model.HoodieBaseFile; import org.apache.hudi.common.model.HoodieCommitMetadata; @@ -89,17 +87,14 @@ import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; import java.util.HashMap; -import java.util.Iterator; import java.util.LinkedList; import java.util.List; import java.util.Locale; import java.util.Map; -import java.util.Objects; import java.util.Set; import java.util.function.Function; import java.util.stream.Collectors; import java.util.stream.IntStream; -import java.util.stream.Stream; import static org.apache.hudi.avro.HoodieAvroUtils.addMetadataFields; import static org.apache.hudi.common.config.HoodieMetadataConfig.DEFAULT_METADATA_POPULATE_META_FIELDS; @@ -939,8 +934,8 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableM // Updates for record index are created by parsing the WriteStatus which is a hudi-client object. Hence, we cannot yet move this code // to the HoodieTableMetadataUtil class in hudi-common. - HoodieData updatesFromWriteStatuses = getRecordIndexUpdates(writeStatus); - HoodieData additionalUpdates = getRecordIndexAdditionalUpdates(updatesFromWriteStatuses, commitMetadata); + HoodieData updatesFromWriteStatuses = getRecordIndexUpserts(writeStatus); + HoodieData additionalUpdates = getRecordIndexAdditionalUpserts(updatesFromWriteStatuses, commitMetadata); partitionToRecordMap.put(RECORD_INDEX, updatesFromWriteStatuses.union(additionalUpdates)); updateFunctionalIndexIfPresent(commitMetadata, instantTime, partitionToRecordMap); return partitionToRecordMap; @@ -953,7 +948,7 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableM processAndCommit(instantTime, () -> { Map> partitionToRecordMap = HoodieTableMetadataUtil.convertMetadataToRecor
Re: [PR] [HUDI-6822] Fix deletes handling in hbase index when partition path is updated [hudi]
nsivabalan merged PR #9630: URL: https://github.com/apache/hudi/pull/9630 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7154] Fix NPE from empty batch with row writer enabled in Hudi Streamer (#10198)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new df4cca8aa56 [HUDI-7154] Fix NPE from empty batch with row writer enabled in Hudi Streamer (#10198) df4cca8aa56 is described below commit df4cca8aa560d21bde1bf4c1a4079d3d2f760c6f Author: Y Ethan Guo AuthorDate: Mon Dec 4 08:06:59 2023 -0800 [HUDI-7154] Fix NPE from empty batch with row writer enabled in Hudi Streamer (#10198) - Co-authored-by: sivabalan --- .../org/apache/hudi/HoodieSparkSqlWriter.scala | 26 +++ .../apache/hudi/utilities/streamer/StreamSync.java | 5 ++- .../deltastreamer/TestHoodieDeltaStreamer.java | 51 ++ 3 files changed, 62 insertions(+), 20 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala index b8dbb18287e..e925e2a5423 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala @@ -155,19 +155,27 @@ object HoodieSparkSqlWriter { Metrics.shutdownAllMetrics() } - def getBulkInsertRowConfig(writerSchema: Schema, hoodieConfig: HoodieConfig, + def getBulkInsertRowConfig(writerSchema: org.apache.hudi.common.util.Option[Schema], hoodieConfig: HoodieConfig, basePath: String, tblName: String): HoodieWriteConfig = { -val writerSchemaStr = writerSchema.toString - +var writerSchemaStr : String = null +if ( writerSchema.isPresent) { + writerSchemaStr = writerSchema.get().toString +} // Make opts mutable since it could be modified by tryOverrideParquetWriteLegacyFormatProperty -val opts = mutable.Map() ++ hoodieConfig.getProps.toMap ++ - Map(HoodieWriteConfig.AVRO_SCHEMA_STRING.key -> writerSchemaStr) +val optsWithoutSchema = mutable.Map() ++ hoodieConfig.getProps.toMap +val opts = if (writerSchema.isPresent) { + optsWithoutSchema ++ Map(HoodieWriteConfig.AVRO_SCHEMA_STRING.key -> writerSchemaStr) +} else { + optsWithoutSchema +} + +if (writerSchema.isPresent) { + // Auto set the value of "hoodie.parquet.writelegacyformat.enabled" + tryOverrideParquetWriteLegacyFormatProperty(opts, convertAvroSchemaToStructType(writerSchema.get)) +} -// Auto set the value of "hoodie.parquet.writelegacyformat.enabled" -tryOverrideParquetWriteLegacyFormatProperty(opts, convertAvroSchemaToStructType(writerSchema)) DataSourceUtils.createHoodieConfig(writerSchemaStr, basePath, tblName, opts) } - } class HoodieSparkSqlWriterInternal { @@ -779,7 +787,7 @@ class HoodieSparkSqlWriterInternal { val sqlContext = writeClient.getEngineContext.asInstanceOf[HoodieSparkEngineContext].getSqlContext val jsc = writeClient.getEngineContext.asInstanceOf[HoodieSparkEngineContext].getJavaSparkContext -val writeConfig = HoodieSparkSqlWriter.getBulkInsertRowConfig(writerSchema, hoodieConfig, basePath.toString, tblName) +val writeConfig = HoodieSparkSqlWriter.getBulkInsertRowConfig(org.apache.hudi.common.util.Option.of(writerSchema), hoodieConfig, basePath.toString, tblName) val overwriteOperationType = Option(hoodieConfig.getString(HoodieInternalConfig.BULKINSERT_OVERWRITE_OPERATION_TYPE)) .map(WriteOperationType.fromValue) .orNull diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java index 19289e650c4..ff2debc8dcc 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java @@ -757,7 +757,8 @@ public class StreamSync implements Serializable, Closeable { hoodieConfig.setValue(DataSourceWriteOptions.PAYLOAD_CLASS_NAME().key(), cfg.payloadClassName); hoodieConfig.setValue(HoodieWriteConfig.KEYGENERATOR_CLASS_NAME.key(), HoodieSparkKeyGeneratorFactory.getKeyGeneratorClassName(props)); hoodieConfig.setValue("path", cfg.targetBasePath); -return HoodieSparkSqlWriter.getBulkInsertRowConfig(writerSchema, hoodieConfig, cfg.targetBasePath, cfg.targetTableName); +return HoodieSparkSqlWriter.getBulkInsertRowConfig(writerSchema != InputBatch.NULL_SCHEMA ? Option.of(writerSchema) : Option.empty(), +hoodieConfig, cfg.targetBasePath, cfg.targetTableName); } /** @@ -899,7 +900,7 @@ public class StreamSync implements Serializable, Closeable { instantTime = startCommit(instantTime, !autoGenerateRecordKeys); if (useRow
Re: [PR] [HUDI-7154] Fix NPE from empty batch with row writer enabled in Hudi Streamer [hudi]
nsivabalan merged PR #10198: URL: https://github.com/apache/hudi/pull/10198 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7154] Fix NPE from empty batch with row writer enabled in Hudi Streamer [hudi]
nsivabalan commented on PR #10198: URL: https://github.com/apache/hudi/pull/10198#issuecomment-1838961926 https://github.com/apache/hudi/assets/513218/0c31514a-8f93-41a3-adbc-63ccdceb2e5e";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838828006 ## CI report: * 22f5d8a5c8f2719aa9602958913fef1e2ee969b9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21289) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] Failed to create Marker file [hudi]
GergelyKalmar commented on issue #7909: URL: https://github.com/apache/hudi/issues/7909#issuecomment-1838811464 We're using Hudi `0.12.1` via AWS Glue and we also started facing the "Failed to create marker file" errors. We tried to change the configuration and use `hoodie.write.markers.type=DIRECT`, however, now we're seeing throttling errors: ``` org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :20 at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244) at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102) at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1525) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376) at org.apache.spark.rdd.RDD.iterator(RDD.scala:327) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:138) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1517) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: xxx; S3 Extended Request ID: xxx; Proxy: null), S3 Extended Request ID: xxx at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(Amaz
[jira] [Updated] (HUDI-2857) HoodieTableMetaClient.TEMPFOLDER_NAME causes IllegalArgumentException in windows environment
[ https://issues.apache.org/jira/browse/HUDI-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wang fanming updated HUDI-2857: --- Issue Type: Improvement (was: Bug) Priority: Minor (was: Major) > HoodieTableMetaClient.TEMPFOLDER_NAME causes IllegalArgumentException in > windows environment > > > Key: HUDI-2857 > URL: https://issues.apache.org/jira/browse/HUDI-2857 > Project: Apache Hudi > Issue Type: Improvement >Affects Versions: 0.9.0 > Environment: win10 spark2.4.4 hudi 0.9.0 >Reporter: wang fanming >Priority: Minor > Labels: core-flow-ds, easyfix, sev:high > Original Estimate: 12h > Remaining Estimate: 12h > > {code:java} > val tableName = "cow_prices" > val basePath = "hdfs://x:9000//tmp//cow_prices//" > val dataGen = new DataGenerator > // spark-shell > val inserts = convertToStringList(dataGen.generateInserts(10)) > val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) > df.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(PRECOMBINE_FIELD.key(), "ts"). > option(RECORDKEY_FIELD.key(), "uuid"). > option(PARTITIONPATH_FIELD.key(), "partitionpath"). > option(TBL_NAME.key(), tableName). > mode(Overwrite). > save(basePath) {code} > The above is the sample code provided by Hudi's official website. I plan to > run the Spark program directly on the win10 environment and store the data on > the remote HDFS.The following exception occurred: > {code:java} > Caused by: java.lang.IllegalArgumentException: Not in marker dir. Marker > Path=hdfs://10.38.23.2:9000/tmp/cow_prices/.hoodie\.temp/20211125163531/asia/india/chennai/c9218a3b-f248-436b-b41f-4a0b968dfff2-0_2-27-29_20211125163531.parquet.marker.CREATE, > Expected Marker Root=/tmp/cow_prices/.hoodie/.temp/20211125163531 > at > org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40) > at > org.apache.hudi.common.util.MarkerUtils.stripMarkerFolderPrefix(MarkerUtils.java:87) > at > org.apache.hudi.common.util.MarkerUtils.stripMarkerFolderPrefix(MarkerUtils.java:75) > at > org.apache.hudi.table.marker.DirectWriteMarkers.translateMarkerToDataPath(DirectWriteMarkers.java:153) > at > org.apache.hudi.table.marker.DirectWriteMarkers.lambda$createdAndMergedDataPaths$69cdea3b$1(DirectWriteMarkers.java:142) > at > org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:78) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1334) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1334) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > After investigation, it was found that the root cause of the abnormality was > that > {code:java} > HoodieTableMetaClient.TEMPFOLDER_