[jira] [Updated] (HUDI-1739) Standardize usage of replacecommit files across the code base
[ https://issues.apache.org/jira/browse/HUDI-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-1739: -- Reviewers: Sagar Sumit > Standardize usage of replacecommit files across the code base > - > > Key: HUDI-1739 > URL: https://issues.apache.org/jira/browse/HUDI-1739 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Jagmeet Bali >Assignee: Susu Dong >Priority: Critical > > Fixes can be to > # Ignore empty replacecommit.requested files. > # Standardise the replacecommit.requested format across all invocations be > it from clustering or this use case. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9558: [HUDI-6481] Support run multi tables services in a single spark job
hudi-bot commented on PR #9558: URL: https://github.com/apache/hudi/pull/9558#issuecomment-1702225241 ## CI report: * 1640805e55e219b1c512bde9650849613c03e0b9 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19598) * ffc02724376dc67f1d5426fc1d95cbf1725d0261 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19603) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] empcl commented on a diff in pull request #9592: automatically create a database when using the flink catalog dfs mode
empcl commented on code in PR #9592: URL: https://github.com/apache/hudi/pull/9592#discussion_r1312612204 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java: ## Review Comment: because there is already a judgment,`org.apache.hudi.table.catalog.TestHoodieCatalog#testDatabaseExists` ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java: ## Review Comment: because there is already a judgment,`org.apache.hudi.table.catalog.TestHoodieCatalog#testDatabaseExists` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9558: [HUDI-6481] Support run multi tables services in a single spark job
hudi-bot commented on PR #9558: URL: https://github.com/apache/hudi/pull/9558#issuecomment-1702217716 ## CI report: * d0a5621c43699e3cd636c99ef6cc048788f04459 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19573) * 1640805e55e219b1c512bde9650849613c03e0b9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19598) * ffc02724376dc67f1d5426fc1d95cbf1725d0261 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9592: automatically create a database when using the flink catalog dfs mode
danny0405 commented on code in PR #9592: URL: https://github.com/apache/hudi/pull/9592#discussion_r1312603674 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java: ## Review Comment: Can you add a test case where default database is created? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6732) Handle wildcards for partition paths passed in via spark-sql
[ https://issues.apache.org/jira/browse/HUDI-6732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6732: - Fix Version/s: 1.0.0 > Handle wildcards for partition paths passed in via spark-sql > > > Key: HUDI-6732 > URL: https://issues.apache.org/jira/browse/HUDI-6732 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Assignee: voon >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Attachments: image-2023-08-21-14-59-27-095.png > > > The drop partition DDL is not handling wildcards properly, specifically, for > partitions with wildcards that are submitted via the Spark-SQL entry point. > > {code:java} > ALTER TABLE table_x DROP PARTITION(partition_col="*") {code} > > The Spark-SQL entrypoint will url-encode special characters, causing the * > character to be url-encoded to {*}%2A{*}, as such, we will need to handle > that too. > > !image-2023-08-21-14-59-27-095.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6732) Handle wildcards for partition paths passed in via spark-sql
[ https://issues.apache.org/jira/browse/HUDI-6732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6732. Resolution: Fixed Fixed via master branch: 64a05bc0b874fd2f3ce01c669840bb619550f033 > Handle wildcards for partition paths passed in via spark-sql > > > Key: HUDI-6732 > URL: https://issues.apache.org/jira/browse/HUDI-6732 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Assignee: voon >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Attachments: image-2023-08-21-14-59-27-095.png > > > The drop partition DDL is not handling wildcards properly, specifically, for > partitions with wildcards that are submitted via the Spark-SQL entry point. > > {code:java} > ALTER TABLE table_x DROP PARTITION(partition_col="*") {code} > > The Spark-SQL entrypoint will url-encode special characters, causing the * > character to be url-encoded to {*}%2A{*}, as such, we will need to handle > that too. > > !image-2023-08-21-14-59-27-095.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6732] Allow wildcards from Spark-SQL entrypoints for drop partition DDL (#9491)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 64a05bc0b87 [HUDI-6732] Allow wildcards from Spark-SQL entrypoints for drop partition DDL (#9491) 64a05bc0b87 is described below commit 64a05bc0b874fd2f3ce01c669840bb619550f033 Author: voonhous AuthorDate: Fri Sep 1 13:54:27 2023 +0800 [HUDI-6732] Allow wildcards from Spark-SQL entrypoints for drop partition DDL (#9491) --- .../org/apache/hudi/HoodieSparkSqlWriter.scala | 6 ++-- .../sql/hudi/TestAlterTableDropPartition.scala | 36 ++ 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala index cf78e514dda..6d0ce7d16bf 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala @@ -606,7 +606,8 @@ object HoodieSparkSqlWriter { */ private def resolvePartitionWildcards(partitions: List[String], jsc: JavaSparkContext, cfg: HoodieConfig, basePath: String): List[String] = { //find out if any of the input partitions have wildcards -var (wildcardPartitions, fullPartitions) = partitions.partition(partition => partition.contains("*")) +//note:spark-sql may url-encode special characters (* -> %2A) +var (wildcardPartitions, fullPartitions) = partitions.partition(partition => partition.matches(".*(\\*|%2A).*")) if (wildcardPartitions.nonEmpty) { //get list of all partitions @@ -621,7 +622,8 @@ object HoodieSparkSqlWriter { //prevent that from happening. Any text inbetween \\Q and \\E is considered literal //So we start the string with \\Q and end with \\E and then whenever we find a * we add \\E before //and \\Q after so all other characters besides .* will be enclosed between a set of \\Q \\E -val regexPartition = "^\\Q" + partition.replace("*", "\\E.*\\Q") + "\\E$" +val wildcardToken: String = if (partition.contains("*")) "*" else "%2A" +val regexPartition = "^\\Q" + partition.replace(wildcardToken, "\\E.*\\Q") + "\\E$" //filter all partitions with the regex and append the result to the list of full partitions fullPartitions = List.concat(fullPartitions,allPartitions.filter(_.matches(regexPartition))) diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTableDropPartition.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTableDropPartition.scala index 2261e83f7f9..b421732d270 100644 --- a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTableDropPartition.scala +++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTableDropPartition.scala @@ -620,4 +620,40 @@ class TestAlterTableDropPartition extends HoodieSparkSqlTestBase { checkExceptionContain(s"ALTER TABLE $tableName DROP PARTITION($partition)")(errMsg) } } + + test("Test drop partition with wildcards") { +withRecordType()(withTempDir { tmp => + Seq("cow", "mor").foreach { tableType => +val tableName = generateTableName +spark.sql( + s""" + |create table $tableName ( + | id int, + | name string, + | price double, + | ts long, + | partition_date_col string + |) using hudi + | location '${tmp.getCanonicalPath}/$tableName' + | tblproperties ( + | primaryKey ='id', + | type = '$tableType', + | preCombineField = 'ts' + | ) partitioned by (partition_date_col) + """.stripMargin) +spark.sql(s"insert into $tableName values " + + s"(1, 'a1', 10, 1000, '2023-08-01'), (2, 'a2', 10, 1000, '2023-08-02'), (3, 'a3', 10, 1000, '2023-09-01')") +checkAnswer(s"show partitions $tableName")( + Seq("partition_date_col=2023-08-01"), + Seq("partition_date_col=2023-08-02"), + Seq("partition_date_col=2023-09-01") +) +spark.sql(s"alter table $tableName drop partition(partition_date_col='2023-08-*')") +// show partitions will still return all partitions for tests, use select distinct as a stop-gap +checkAnswer(s"select distinct partition_date_col from $tableName")( + Seq("2023-09-01") +) + } +}) + } }
[GitHub] [hudi] danny0405 merged pull request #9491: [HUDI-6732] Allow wildcards from Spark-SQL entrypoints for drop parti…
danny0405 merged PR #9491: URL: https://github.com/apache/hudi/pull/9491 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9595: [MINOR] Catch EntityNotFoundException correctly
hudi-bot commented on PR #9595: URL: https://github.com/apache/hudi/pull/9595#issuecomment-1702170861 ## CI report: * 0cf80bdf054737a6f13bccc8250ce1b3686a0e8b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19601) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9592: automatically create a database when using the flink catalog dfs mode
hudi-bot commented on PR #9592: URL: https://github.com/apache/hudi/pull/9592#issuecomment-1702170818 ## CI report: * c961be19038e5600f418ef660b7ede740cef76c6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19581) * 702653a08249790e738497e49ddc9970613e2343 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19600) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] empcl commented on a diff in pull request #9592: automatically create a database when using the flink catalog dfs mode
empcl commented on code in PR #9592: URL: https://github.com/apache/hudi/pull/9592#discussion_r1312545914 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java: ## Review Comment: @danny0405 Hello, currently in the test cases, we should not manually create the caatalog+db path, but instead create the db directory by calling the open() method -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] aajisaka commented on a diff in pull request #9577: [HUDI-6805] Print detailed error message in clustering
aajisaka commented on code in PR #9577: URL: https://github.com/apache/hudi/pull/9577#discussion_r1312545323 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java: ## @@ -241,6 +242,9 @@ public WriteStatus close() throws IOException { stat.setTotalWriteBytes(fileSizeInBytes); stat.setFileSizeInBytes(fileSizeInBytes); stat.setTotalWriteErrors(writeStatus.getTotalErrorRecords()); +for (Pair pair : writeStatus.getFailedRecords()) { + LOG.error("Failed to write {}", pair.getLeft(), pair.getRight()); +} Review Comment: There's low possibility as Hudi doesn't store all the exception in `writeStatus.getFailedRecords()`. By default, 10% of the errors are stored and the percentage is configurable via `hoodie.memory.writestatus.failure.fraction`. Note that the first error is always stored. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9595: [MINOR] Catch EntityNotFoundException correctly
hudi-bot commented on PR #9595: URL: https://github.com/apache/hudi/pull/9595#issuecomment-1702164954 ## CI report: * 0cf80bdf054737a6f13bccc8250ce1b3686a0e8b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9592: automatically create a database when using the flink catalog dfs mode
hudi-bot commented on PR #9592: URL: https://github.com/apache/hudi/pull/9592#issuecomment-1702164906 ## CI report: * c961be19038e5600f418ef660b7ede740cef76c6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19581) * 702653a08249790e738497e49ddc9970613e2343 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9585: [HUDI-6809] Optimizing the judgment of generating clustering plans
hudi-bot commented on PR #9585: URL: https://github.com/apache/hudi/pull/9585#issuecomment-1702164849 ## CI report: * 67e18f40f585f17a96068ca4737a0dd7d800354e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19593) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9594: [HUDI-6742] Remove the log file appending for multiple instants
hudi-bot commented on PR #9594: URL: https://github.com/apache/hudi/pull/9594#issuecomment-1702158693 ## CI report: * ac71c9982c1d47e3df2332671d1981d1bee51ab7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19599) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] CTTY opened a new pull request, #9595: [MINOR] Catch EntityNotFoundException correctly
CTTY opened a new pull request, #9595: URL: https://github.com/apache/hudi/pull/9595 ### Change Logs When table/database is not found when syncing table to Glue, glue should return `EntityNotFoundException`. After upgrading to AWS SDK V2, Hudi uses `GlueAsyncClient` to get a `CompletableFuture`, which would throw `ExecutionException` with `EntityNotFoundException` nested when table/database doesn't exist. However, existing Hudi code doesn't handle `ExecutionException` and would fail the job. Sample exception: ``` org.apache.hudi.exception.HoodieMetaSyncException: Could not sync using the meta sync class org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:81) at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:959) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:957) at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:1055) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:409) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:554) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:107) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:554) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:530) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:97) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:84) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.s
[GitHub] [hudi] voonhous commented on pull request #9491: [HUDI-6732] Allow wildcards from Spark-SQL entrypoints for drop parti…
voonhous commented on PR #9491: URL: https://github.com/apache/hudi/pull/9491#issuecomment-1702138452 @danny0405 Gentle reminder, CI is green. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9558: [HUDI-6481] Support run multi tables services in a single spark job
hudi-bot commented on PR #9558: URL: https://github.com/apache/hudi/pull/9558#issuecomment-1702134207 ## CI report: * d0a5621c43699e3cd636c99ef6cc048788f04459 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19573) * 1640805e55e219b1c512bde9650849613c03e0b9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19598) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9590: [HUDI-6780] Introduce enums instead of classnames in table properties
danny0405 commented on code in PR #9590: URL: https://github.com/apache/hudi/pull/9590#discussion_r1312518092 ## hudi-common/src/main/java/org/apache/hudi/common/model/RecordPayloadType.java: ## @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.common.model; + +import org.apache.hudi.common.config.EnumDescription; +import org.apache.hudi.common.config.EnumFieldDescription; +import org.apache.hudi.common.model.debezium.MySqlDebeziumAvroPayload; +import org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload; + +/** + * Payload to use for record. + */ +@EnumDescription("Payload to use for merging records") +public enum RecordPayloadType { + @EnumFieldDescription("Provides support for seamlessly applying changes captured via Amazon Database Migration Service onto S3.") + AWS_DMS_AVRO(AWSDmsAvroPayload.class.getName()), + + @EnumFieldDescription("Honors ordering field in both preCombine and combineAndGetUpdateValue.") + HOODIE_AVRO_DEFAULT(DefaultHoodieRecordPayload.class.getName()), Review Comment: Are these options expected to be used by users? Then there might be in-consistency for table config and write config, for write config, we still prefer the class name ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9594: [HUDI-6742] Remove the log file appending for multiple instants
hudi-bot commented on PR #9594: URL: https://github.com/apache/hudi/pull/9594#issuecomment-1702129341 ## CI report: * ac71c9982c1d47e3df2332671d1981d1bee51ab7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9558: [HUDI-6481] Support run multi tables services in a single spark job
hudi-bot commented on PR #9558: URL: https://github.com/apache/hudi/pull/9558#issuecomment-1702129218 ## CI report: * d0a5621c43699e3cd636c99ef6cc048788f04459 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19573) * 1640805e55e219b1c512bde9650849613c03e0b9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9515: [HUDI-2141] Support flink compaction metrics
danny0405 commented on PR #9515: URL: https://github.com/apache/hudi/pull/9515#issuecomment-1702127608 > Tested locally and only these four metrics are useless. Remove them until we support coordinator metrics. @danny0405 What do you think? +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 closed pull request #9475: [HUDI-6766] Fixing mysql debezium data loss
danny0405 closed pull request #9475: [HUDI-6766] Fixing mysql debezium data loss URL: https://github.com/apache/hudi/pull/9475 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #9587: [SUPPORT] hoodie.datasource.write.keygenerator.class config not work in bulk_insert mode
danny0405 commented on issue #9587: URL: https://github.com/apache/hudi/issues/9587#issuecomment-1702126617 > Myabe i can fix it make simple key generator support mult partition keys Makes sense to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] punish-yh commented on issue #9587: [SUPPORT] hoodie.datasource.write.keygenerator.class config not work in bulk_insert mode
punish-yh commented on issue #9587: URL: https://github.com/apache/hudi/issues/9587#issuecomment-1702123005 > You are right, because you only have one primary key field: `eid`, maybe you should set up the spark key generator as simple. Thank you for your reply, I used `hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleAvroKeyGenerator` run this job again. bulk_insert job is successful end. but in upsert mode record was writed to `__HIVE_DEFAULT_PARTITION__` partition, because I configurate `_db` and `_table` field as partition field. but simple key generator does not split partition field, and in getPartitionPath function mismatch fields so that return `__HIVE_DEFAULT_PARTITION__` ![image](https://github.com/apache/hudi/assets/59658062/af0dfaff-3cc6-4758-b315-c3aaedfe0b14) ![image](https://github.com/apache/hudi/assets/59658062/fdc6590d-6c56-4d08-9a44-6725e3b48742) ![image](https://github.com/apache/hudi/assets/59658062/998a196e-6f81-4b82-aef8-0c440b7af297) now , I can use custom key generator to fix my problem. But I would like to ask if this aligns with Simple key generator initial design ? Myabe i can fix it make simple key generator support mult partition keys -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 opened a new pull request, #9594: [HUDI-6742] Remove the log file appending for multiple instants
danny0405 opened a new pull request, #9594: URL: https://github.com/apache/hudi/pull/9594 ### Change Logs Remove the log file appending totally to simplify the log file rollback and exception handling for reader. ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] twlo-sandeep commented on pull request #9475: [HUDI-6766] Fixing mysql debezium data loss
twlo-sandeep commented on PR #9475: URL: https://github.com/apache/hudi/pull/9475#issuecomment-1702114672 > There are test failures in Travis. @danny0405 I don't see any failed tests in both the failed suites. It looks like a time out after running for 5hr+. Can you trigger a rerun of tests? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on a diff in pull request #9558: [HUDI-6481] Support run multi tables services in a single spark job
stream2000 commented on code in PR #9558: URL: https://github.com/apache/hudi/pull/9558#discussion_r1312487436 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/multitable/MultiTableServiceUtils.java: ## @@ -0,0 +1,167 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.multitable; + +import org.apache.hudi.client.common.HoodieSparkEngineContext; +import org.apache.hudi.common.config.SerializableConfiguration; +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.exception.HoodieException; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.spark.api.java.JavaSparkContext; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import java.util.concurrent.CopyOnWriteArrayList; +import java.util.stream.Collectors; + +import static org.apache.hudi.common.table.HoodieTableMetaClient.METAFOLDER_NAME; + +/** + * Utils for executing multi-table services + */ +public class MultiTableServiceUtils { + + public static class Constants { +public static final String TABLES_TO_BE_SERVED_PROP = "hoodie.tableservice.tablesToServe"; + +public static final String COMMA_SEPARATOR = ","; + +private static final int DEFAULT_LISTING_PARALLELISM = 1500; + } + + public static List getTablesToBeServedFromProps(TypedProperties properties) { +String combinedTablesString = properties.getString(Constants.TABLES_TO_BE_SERVED_PROP); +if (combinedTablesString == null) { + return new ArrayList<>(); +} +String[] tablesArray = combinedTablesString.split(Constants.COMMA_SEPARATOR); +return Arrays.asList(tablesArray); + } + + public static List findHoodieTablesUnderPath(JavaSparkContext jsc, String pathStr) { +Path rootPath = new Path(pathStr); +SerializableConfiguration conf = new SerializableConfiguration(jsc.hadoopConfiguration()); +if (isHoodieTable(rootPath, conf.get())) { + return Collections.singletonList(pathStr); +} + +HoodieSparkEngineContext engineContext = new HoodieSparkEngineContext(jsc); +List hoodieTablePaths = new CopyOnWriteArrayList<>(); +List pathsToList = new CopyOnWriteArrayList<>(); +pathsToList.add(rootPath); +int listingParallelism = Math.min(Constants.DEFAULT_LISTING_PARALLELISM, pathsToList.size()); + +while (!pathsToList.isEmpty()) { + // List all directories in parallel + List dirToFileListing = engineContext.map(pathsToList, path -> { +FileSystem fileSystem = path.getFileSystem(conf.get()); +return fileSystem.listStatus(path); + }, listingParallelism); + pathsToList.clear(); + + // if current dictionary contains meta folder(.hoodie), add it to result. Otherwise, add it to queue + List dirs = dirToFileListing.stream().flatMap(Arrays::stream) + .filter(FileStatus::isDirectory) + .collect(Collectors.toList()); + + if (!dirs.isEmpty()) { +List> dirResults = engineContext.map(dirs, fileStatus -> { + if (isHoodieTable(fileStatus.getPath(), conf.get())) { Review Comment: Nice catch~ It's not a good design that uses hard-coded magic number, already updated the magic number to meaningful enum constants. Updated to: ```java /** * Type of directories when searching hoodie tables under path */ enum DirType { HOODIE_TABLE, // previous 0 NORMAL_DIR, // previous 1 META_FOLDER // previous 2 } ``` ## hudi-utilities/src/main/java/org/apache/hudi/utilities/multitable/HoodieMultiTableServicesMain.java: ## @@ -0,0 +1,255 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except
[GitHub] [hudi] hudi-bot commented on pull request #9553: [HUDI-1517][HUDI-6758][HUDI-6761] Adding support for per-logfile marker to track all log files added by a commit and to assist with rollbacks
hudi-bot commented on PR #9553: URL: https://github.com/apache/hudi/pull/9553#issuecomment-1702097156 ## CI report: * aeac327c3cad812fea5e2bc01c07c1314bbf1838 UNKNOWN * 2554ca28ddffba3e8ffb64db090daf85ffae187b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19555) * 835ac846b8de9a27eac4a1e2e3eb27fbdf55c9dd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19596) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9515: [HUDI-2141] Support flink compaction metrics
hudi-bot commented on PR #9515: URL: https://github.com/apache/hudi/pull/9515#issuecomment-1702097038 ## CI report: * 33ea8bad45355a5cfb69955f372f0e3a87540aae Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19466) * a11cc23103021a2916d2759bead59b61a80e50f7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19597) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9553: [HUDI-1517][HUDI-6758][HUDI-6761] Adding support for per-logfile marker to track all log files added by a commit and to assist with rollbacks
hudi-bot commented on PR #9553: URL: https://github.com/apache/hudi/pull/9553#issuecomment-1702091723 ## CI report: * aeac327c3cad812fea5e2bc01c07c1314bbf1838 UNKNOWN * 2554ca28ddffba3e8ffb64db090daf85ffae187b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19555) * 835ac846b8de9a27eac4a1e2e3eb27fbdf55c9dd UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9515: [HUDI-2141] Support flink compaction metrics
hudi-bot commented on PR #9515: URL: https://github.com/apache/hudi/pull/9515#issuecomment-1702091623 ## CI report: * 33ea8bad45355a5cfb69955f372f0e3a87540aae Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19466) * a11cc23103021a2916d2759bead59b61a80e50f7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9584: [HUDI-6808] SkipCompaction Config should not affect the stream read of the cow table
hudi-bot commented on PR #9584: URL: https://github.com/apache/hudi/pull/9584#issuecomment-1702078216 ## CI report: * cd3a969fbe188f1bcf77047d898d5d05e3566caa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19580) * cba1cba13bbd6ae0fcd237c1bedbc99a626909f3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19594) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6579] Fix streaming write when meta cols dropped (#9589)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 1450b1b04f7 [HUDI-6579] Fix streaming write when meta cols dropped (#9589) 1450b1b04f7 is described below commit 1450b1b04f7feef4e49dabdac3fb062e04a90c58 Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com> AuthorDate: Thu Aug 31 21:57:11 2023 -0500 [HUDI-6579] Fix streaming write when meta cols dropped (#9589) --- .../main/scala/org/apache/hudi/DefaultSource.scala | 36 +++--- .../org/apache/hudi/HoodieCreateRecordUtils.scala | 11 +++ .../org/apache/hudi/HoodieSparkSqlWriter.scala | 14 - 3 files changed, 29 insertions(+), 32 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala index 5a0b0a53d33..f982fb1e1c3 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala @@ -19,17 +19,17 @@ package org.apache.hudi import org.apache.hadoop.fs.Path import org.apache.hudi.DataSourceReadOptions._ -import org.apache.hudi.DataSourceWriteOptions.{BOOTSTRAP_OPERATION_OPT_VAL, OPERATION, RECORDKEY_FIELD, SPARK_SQL_WRITES_PREPPED_KEY, STREAMING_CHECKPOINT_IDENTIFIER} +import org.apache.hudi.DataSourceWriteOptions.{BOOTSTRAP_OPERATION_OPT_VAL, OPERATION, STREAMING_CHECKPOINT_IDENTIFIER} import org.apache.hudi.cdc.CDCRelation import org.apache.hudi.common.fs.FSUtils import org.apache.hudi.common.model.HoodieTableType.{COPY_ON_WRITE, MERGE_ON_READ} -import org.apache.hudi.common.model.{HoodieRecord, WriteConcurrencyMode} +import org.apache.hudi.common.model.WriteConcurrencyMode import org.apache.hudi.common.table.timeline.HoodieInstant import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} import org.apache.hudi.common.util.ConfigUtils import org.apache.hudi.common.util.ValidationUtils.checkState import org.apache.hudi.config.HoodieBootstrapConfig.DATA_QUERIES_ONLY -import org.apache.hudi.config.HoodieWriteConfig.{SPARK_SQL_MERGE_INTO_PREPPED_KEY, WRITE_CONCURRENCY_MODE} +import org.apache.hudi.config.HoodieWriteConfig.WRITE_CONCURRENCY_MODE import org.apache.hudi.exception.HoodieException import org.apache.hudi.util.PathUtils import org.apache.spark.sql.execution.streaming.{Sink, Source} @@ -124,21 +124,21 @@ class DefaultSource extends RelationProvider } /** -* This DataSource API is used for writing the DataFrame at the destination. For now, we are returning a dummy -* relation here because Spark does not really make use of the relation returned, and just returns an empty -* dataset at [[org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run()]]. This saves us the cost -* of creating and returning a parquet relation here. -* -* TODO: Revisit to return a concrete relation here when we support CREATE TABLE AS for Hudi with DataSource API. -* That is the only case where Spark seems to actually need a relation to be returned here -* [[org.apache.spark.sql.execution.datasources.DataSource.writeAndRead()]] -* -* @param sqlContext Spark SQL Context -* @param mode Mode for saving the DataFrame at the destination -* @param optParams Parameters passed as part of the DataFrame write operation -* @param rawDf Spark DataFrame to be written -* @return Spark Relation -*/ + * This DataSource API is used for writing the DataFrame at the destination. For now, we are returning a dummy + * relation here because Spark does not really make use of the relation returned, and just returns an empty + * dataset at [[org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run()]]. This saves us the cost + * of creating and returning a parquet relation here. + * + * TODO: Revisit to return a concrete relation here when we support CREATE TABLE AS for Hudi with DataSource API. + * That is the only case where Spark seems to actually need a relation to be returned here + * [[org.apache.spark.sql.execution.datasources.DataSource.writeAndRead()]] + * + * @param sqlContext Spark SQL Context + * @param mode Mode for saving the DataFrame at the destination + * @param optParams Parameters passed as part of the DataFrame write operation + * @param df Spark DataFrame to be written + * @return Spark Relation + */ override def createRelation(sqlContext: SQLContext, mode: SaveMode, optParams: Map[String, String], diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apac
[GitHub] [hudi] xushiyan merged pull request #9589: [HUDI-6579] Fix streaming write when meta cols dropped
xushiyan merged PR #9589: URL: https://github.com/apache/hudi/pull/9589 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on pull request #9515: [HUDI-2141] Support flink compaction metrics
stream2000 commented on PR #9515: URL: https://github.com/apache/hudi/pull/9515#issuecomment-1702051289 ![image](https://github.com/apache/hudi/assets/39240496/56dcc6ee-4045-4f52-acb2-1a5883a9f772) Tested locally and only these four metrics are useless. Remove them until we support coordinator metrics. @danny0405 What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] beyond1920 commented on a diff in pull request #7907: [HUDI-6495][RFC-66] Non-blocking multi writer support
beyond1920 commented on code in PR #7907: URL: https://github.com/apache/hudi/pull/7907#discussion_r1312463342 ## rfc/rfc-66/rfc-66.md: ## @@ -0,0 +1,124 @@ +# RFC-66: Lockless Multi Writer + +## Proposers +- @danny0405 +- @ForwardXu +- @SteNicholas + +## Approvers +- + +## Status + +JIRA: [Lockless multi writer support](https://issues.apache.org/jira/browse/HUDI-5672) + +## Abstract +As you know, Hudi already supports basic OCC with abundant lock providers. +But for multi streaming ingestion writers, the OCC does not work well because the conflicts happen in very high frequency. +Expand it a little bit, with hashing index, all the writers have deterministic hashing algorithm for distributing the records by primary keys, +all the keys are evenly distributed in all the data buckets, for a single data flushing in one writer, almost all the data buckets are appended with new inputs, +so the conflict would very possibility happen for mul-writer because almost all the data buckets are being written by multiple writers at the same time; +For bloom filter index, things are different, but remember that we have a small file load rebalance strategy to writer into the **small** bucket in higher priority, +that means, multiple writers prune to write into the same **small** buckets at the same time, that's how conflicts happen. + +In general, for multiple streaming writers ingestion, OCC is not very feasible in production, in this RFC, we propose a non-blocking solution for streaming ingestion. + +## Background + +Streaming jobs are naturally suitable for data ingestion, it has no complexity of pipeline orchestration and has a smother write workload. +Most of the raw data set we are handling today are generating all the time in streaming way. + +Based on that, many requests for multiple writers' ingestion are derived. With multi-writer ingestion, several streaming events with the same schema can be drained into one Hudi table, +the Hudi table kind of becomes a UNION table view for all the input data set. This is a very common use case because in reality, the data sets are usually scattered all over the data sources. + +Another very useful use case we wanna unlock is the real-time data set join. One of the biggest pain point in streaming computation is the dataset join, +the engine like Flink has basic supports for all kind of SQL JOINs, but it stores the input records within its inner state-backend which is a huge cost for pure data join with no additional computations. +In [HUDI-3304](https://issues.apache.org/jira/browse/HUDI-3304), we introduced a `PartialUpdateAvroPayload`, in combination with the lockless multi-writer, +we can implement N-ways data sources join in real-time! Hudi would take care of the payload join during compaction service procedure. + +## Design + +### The Precondition + + MOR Table Type Is Required + +The table type must be `MERGE_ON_READ`, so that we can defer the conflict resolution to the compaction phase. The compaction service would resolve the conflicts of the same keys by respecting the event time sequence of the events. + + Deterministic Bucketing Strategy + +Deterministic bucketing strategy is required, because the same records keys from different writers are desired to be distributed into the same bucket, not only for UPSERTs, but also for all the new INSERTs. + + Lazy Cleaning Strategy + +Config the cleaning strategy as lazy so that the pending instants are not rolled back by the other active writers. + +### Basic Work Flow + + Writing Log Files Separately In Sequence + +Basically, each writer flushes the log files in sequence, the log file rolls over for different versioning number, +a pivotal thing needs to note here is that we need to make the write_token unique for the same version log files with the same base instant time, +so that the file name does not conflict for the writers. + +The log files generated by a single writer can still preserve the sequence by versioning number, which is important if the natual order is needed for single writer events. + +![multi-writer](multi_writer.png) + +### The Compaction Procedure + +The compaction service is the duty role that actually resoves the conflicts. Within a file group, it sorts the files then merge all the record payloads for a record key. +The event time sequence is respected by combining the payloads with even time field provided by the payload (known as the `preCombine` field in Hudi). + +![compaction procedure](compaction.png) + + Non-Serial Compaction Plan Schedule +Currently, the compaction plan scheduling must be in serial order with the writers, that means, while scheduling the compaction plan, no ongoing writers should be writing to +the table. This restriction makes the compaction almost impossible for multi streaming writers because there is always an instant writing to the table for streaming ingestion. + +In order to unblock the compaction
[jira] [Updated] (HUDI-6702) Extend merge API to support all merging operations
[ https://issues.apache.org/jira/browse/HUDI-6702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6702: Reviewers: Ethan Guo > Extend merge API to support all merging operations > -- > > Key: HUDI-6702 > URL: https://issues.apache.org/jira/browse/HUDI-6702 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Sagar Sumit >Assignee: Lin Liu >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > > See this issue for more details- [https://github.com/apache/hudi/issues/9430] > We may have to introduce a new API or figure out a way for the current merger > to skip empty records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6784) Support custom logic for deletion
[ https://issues.apache.org/jira/browse/HUDI-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6784: Reviewers: Ethan Guo > Support custom logic for deletion > - > > Key: HUDI-6784 > URL: https://issues.apache.org/jira/browse/HUDI-6784 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > Add `Optional<>` for newer parameter in merger. If newer is empty, then it > means this is a deletion operation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9585: [HUDI-6809] Optimizing the judgment of generating clustering plans
hudi-bot commented on PR #9585: URL: https://github.com/apache/hudi/pull/9585#issuecomment-1702018623 ## CI report: * 9a2675de94095d2baac571a6dd71ec368b8a9e8c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19582) * 67e18f40f585f17a96068ca4737a0dd7d800354e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19593) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9584: [HUDI-6808] SkipCompaction Config should not affect the stream read of the cow table
hudi-bot commented on PR #9584: URL: https://github.com/apache/hudi/pull/9584#issuecomment-1702018596 ## CI report: * cd3a969fbe188f1bcf77047d898d5d05e3566caa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19580) * cba1cba13bbd6ae0fcd237c1bedbc99a626909f3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9571: Enabling comprehensive schema evolution in delta streamer code
hudi-bot commented on PR #9571: URL: https://github.com/apache/hudi/pull/9571#issuecomment-1702018490 ## CI report: * 871ff24da9c3800b8f19bdabda140621549aaf3b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19588) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9589: [HUDI-6579] Fix streaming write when meta cols dropped
nsivabalan commented on code in PR #9589: URL: https://github.com/apache/hudi/pull/9589#discussion_r1312461061 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCreateRecordUtils.scala: ## @@ -98,7 +95,7 @@ object HoodieCreateRecordUtils { } } // we can skip key generator for prepped flow -val usePreppedInsteadOfKeyGen = preppedSparkSqlWrites && preppedWriteOperation +val usePreppedInsteadOfKeyGen = preppedSparkSqlWrites || preppedWriteOperation Review Comment: yes. this looks good -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-6785) Introduce an engine-agnostic FileGroupReader for snapshot read
[ https://issues.apache.org/jira/browse/HUDI-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6785: --- Assignee: Ethan Guo > Introduce an engine-agnostic FileGroupReader for snapshot read > -- > > Key: HUDI-6785 > URL: https://issues.apache.org/jira/browse/HUDI-6785 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9585: [HUDI-6809] Optimizing the judgment of generating clustering plans
hudi-bot commented on PR #9585: URL: https://github.com/apache/hudi/pull/9585#issuecomment-1702012580 ## CI report: * 9a2675de94095d2baac571a6dd71ec368b8a9e8c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19582) * 67e18f40f585f17a96068ca4737a0dd7d800354e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhuanshenbsj1 commented on a diff in pull request #9584: [HUDI-6808] SkipCompaction Config should not affect the stream read of the cow table
zhuanshenbsj1 commented on code in PR #9584: URL: https://github.com/apache/hudi/pull/9584#discussion_r1312456055 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java: ## @@ -601,9 +602,9 @@ public List filterInstantsWithRange( * @return the filtered timeline */ @VisibleForTesting - public HoodieTimeline filterInstantsAsPerUserConfigs(HoodieTimeline timeline) { + public HoodieTimeline filterInstantsAsPerUserConfigs(HoodieTimeline timeline, HoodieTableType tableType) { final HoodieTimeline oriTimeline = timeline; -if (this.skipCompaction) { +if (OptionsResolver.isMorTable(this.conf) & this.skipCompaction) { Review Comment: Removed the para HoodieTableType. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6784) Support custom logic for deletion
[ https://issues.apache.org/jira/browse/HUDI-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-6784: -- Status: Patch Available (was: In Progress) > Support custom logic for deletion > - > > Key: HUDI-6784 > URL: https://issues.apache.org/jira/browse/HUDI-6784 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > Add `Optional<>` for newer parameter in merger. If newer is empty, then it > means this is a deletion operation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6702) Extend merge API to support all merging operations
[ https://issues.apache.org/jira/browse/HUDI-6702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-6702: -- Status: Patch Available (was: In Progress) > Extend merge API to support all merging operations > -- > > Key: HUDI-6702 > URL: https://issues.apache.org/jira/browse/HUDI-6702 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Sagar Sumit >Assignee: Lin Liu >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > > See this issue for more details- [https://github.com/apache/hudi/issues/9430] > We may have to introduce a new API or figure out a way for the current merger > to skip empty records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6779) Audit current hoodie.properties
[ https://issues.apache.org/jira/browse/HUDI-6779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-6779. - Resolution: Done > Audit current hoodie.properties > --- > > Key: HUDI-6779 > URL: https://issues.apache.org/jira/browse/HUDI-6779 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Fix For: 1.0.0 > > > Remove some configs from table to write configs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6780) Replace classnames by modes/enums in table properties
[ https://issues.apache.org/jira/browse/HUDI-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6780: -- Reviewers: Danny Chen > Replace classnames by modes/enums in table properties > - > > Key: HUDI-6780 > URL: https://issues.apache.org/jira/browse/HUDI-6780 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6780) Replace classnames by modes/enums in table properties
[ https://issues.apache.org/jira/browse/HUDI-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6780: -- Status: Patch Available (was: In Progress) > Replace classnames by modes/enums in table properties > - > > Key: HUDI-6780 > URL: https://issues.apache.org/jira/browse/HUDI-6780 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6779) Audit current hoodie.properties
[ https://issues.apache.org/jira/browse/HUDI-6779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6779: -- Status: Patch Available (was: In Progress) > Audit current hoodie.properties > --- > > Key: HUDI-6779 > URL: https://issues.apache.org/jira/browse/HUDI-6779 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Fix For: 1.0.0 > > > Remove some configs from table to write configs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] linliu-code commented on pull request #9593: [HUDI-6784][RFC-46] Support deletion logic in merger
linliu-code commented on PR #9593: URL: https://github.com/apache/hudi/pull/9593#issuecomment-1702005828 @yihua @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] linliu-code opened a new pull request, #9593: [HUDI-6784][RFC-46] Support deletion logic in merger
linliu-code opened a new pull request, #9593: URL: https://github.com/apache/hudi/pull/9593 ### Change Logs The solution is to add Option wrapper for older and newer parameters in merge api. In such way, all of update, delete, combine logics are merged into one api. TESTS: Unit tests are added for existing merger implementations. ### Impact Users could implement merge api to support their own logic about deletion now. Previously deletion is not supported. ### Risk level (write none, low medium or high below) Low. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy closed pull request #9573: [HUDI-6804] Fix hive read schema evolution MOR table
Zouxxyy closed pull request #9573: [HUDI-6804] Fix hive read schema evolution MOR table URL: https://github.com/apache/hudi/pull/9573 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6742) Remove the log file appending for multiple instants
[ https://issues.apache.org/jira/browse/HUDI-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6742: - Status: In Progress (was: Open) > Remove the log file appending for multiple instants > --- > > Key: HUDI-6742 > URL: https://issues.apache.org/jira/browse/HUDI-6742 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6725) Support efficient completion time queries on the timeline
[ https://issues.apache.org/jira/browse/HUDI-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6725: - Status: Patch Available (was: In Progress) > Support efficient completion time queries on the timeline > - > > Key: HUDI-6725 > URL: https://issues.apache.org/jira/browse/HUDI-6725 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > The basic idea is we do a eager loading of the completion time on archived > timeline, for example, the last 3 days, and all the completed instants of the > active timeline. > If a query is asking about a completion time earlier than that time range, > just do a lazy look up on the archvied timeline. > > Probably we would write a completion time loader. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 commented on a diff in pull request #9592: automatically create a database when using the flink catalog dfs mode
danny0405 commented on code in PR #9592: URL: https://github.com/apache/hudi/pull/9592#discussion_r1312444984 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalog.java: ## @@ -125,6 +125,16 @@ public void open() throws CatalogException { } catch (IOException e) { throw new CatalogException(String.format("Checking catalog path %s exists exception.", catalogPathStr), e); } + +if (!databaseExists(getDefaultDatabase())) { + LOG.info("Creating database {} automatically because it does not exist.", getDefaultDatabase()); + Path dbPath = new Path(catalogPath, getDefaultDatabase()); Review Comment: Can we write a test case for it in `TestHoodieCatalog`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 merged pull request #9583: [MINOR] Update operator name for compact&clustering test class
danny0405 merged PR #9583: URL: https://github.com/apache/hudi/pull/9583 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] Update operator name for compact&clustering test class (#9583)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 6f2e19d933c [MINOR] Update operator name for compact&clustering test class (#9583) 6f2e19d933c is described below commit 6f2e19d933cdd086a1220824bffe6e28b7a50174 Author: hehuiyuan <471627...@qq.com> AuthorDate: Fri Sep 1 09:42:36 2023 +0800 [MINOR] Update operator name for compact&clustering test class (#9583) --- .../org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java | 4 ++-- .../org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java | 8 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java index 18a8aebb8fd..4c817a7927a 100644 --- a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java +++ b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java @@ -410,8 +410,8 @@ public class ITTestHoodieFlinkClustering { // keep pending clustering, not committing clustering dataStream .addSink(new DiscardingSink<>()) -.name("clustering_commit") -.uid("uid_clustering_commit") +.name("discarding-sink") +.uid("uid_discarding-sink") .setParallelism(1); env.execute("flink_hudi_clustering"); diff --git a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java index b032ad46765..ac2d93a7305 100644 --- a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java +++ b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java @@ -175,8 +175,8 @@ public class ITTestHoodieFlinkCompactor { new CompactOperator(conf)) .setParallelism(FlinkMiniCluster.DEFAULT_PARALLELISM) .addSink(new CompactionCommitSink(conf)) -.name("clean_commits") -.uid("uid_clean_commits") +.name("compaction_commit") +.uid("uid_compaction_commit") .setParallelism(1); env.execute("flink_hudi_compaction"); @@ -256,8 +256,8 @@ public class ITTestHoodieFlinkCompactor { new CompactOperator(conf)) .setParallelism(FlinkMiniCluster.DEFAULT_PARALLELISM) .addSink(new CompactionCommitSink(conf)) -.name("clean_commits") -.uid("uid_clean_commits") +.name("compaction_commit") +.uid("uid_compaction_commit") .setParallelism(1); env.execute("flink_hudi_compaction");
[GitHub] [hudi] danny0405 commented on a diff in pull request #9577: [HUDI-6805] Print detailed error message in clustering
danny0405 commented on code in PR #9577: URL: https://github.com/apache/hudi/pull/9577#discussion_r1312443728 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java: ## @@ -241,6 +242,9 @@ public WriteStatus close() throws IOException { stat.setTotalWriteBytes(fileSizeInBytes); stat.setFileSizeInBytes(fileSizeInBytes); stat.setTotalWriteErrors(writeStatus.getTotalErrorRecords()); +for (Pair pair : writeStatus.getFailedRecords()) { + LOG.error("Failed to write {}", pair.getLeft(), pair.getRight()); +} Review Comment: Is there any possibility we have too many records to print so the logs are overwhelmed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9584: [HUDI-6808] SkipCompaction Config should not affect the stream read of the cow table
danny0405 commented on code in PR #9584: URL: https://github.com/apache/hudi/pull/9584#discussion_r1312442629 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java: ## @@ -601,9 +602,9 @@ public List filterInstantsWithRange( * @return the filtered timeline */ @VisibleForTesting - public HoodieTimeline filterInstantsAsPerUserConfigs(HoodieTimeline timeline) { + public HoodieTimeline filterInstantsAsPerUserConfigs(HoodieTimeline timeline, HoodieTableType tableType) { final HoodieTimeline oriTimeline = timeline; -if (this.skipCompaction) { +if (OptionsResolver.isMorTable(this.conf) & this.skipCompaction) { Review Comment: There is no need to pass around the `HoodieTableType` now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6066) HoodieTableSource supports parquet predicate push down
[ https://issues.apache.org/jira/browse/HUDI-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6066: - Fix Version/s: 1.0.0 > HoodieTableSource supports parquet predicate push down > -- > > Key: HUDI-6066 > URL: https://issues.apache.org/jira/browse/HUDI-6066 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: Nicholas Jiang >Assignee: Nicholas Jiang >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > HoodieTableSource supports the implementation of SupportsFilterPushDown > interface that push down filter into FileIndex. HoodieTableSource should > support parquet predicate push down for query performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6066) HoodieTableSource supports parquet predicate push down
[ https://issues.apache.org/jira/browse/HUDI-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6066. Resolution: Fixed Fixed via master branch: 9fa00b7b1547ff46a1bea6d329e20dd702ff90b5 > HoodieTableSource supports parquet predicate push down > -- > > Key: HUDI-6066 > URL: https://issues.apache.org/jira/browse/HUDI-6066 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: Nicholas Jiang >Assignee: Nicholas Jiang >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > HoodieTableSource supports the implementation of SupportsFilterPushDown > interface that push down filter into FileIndex. HoodieTableSource should > support parquet predicate push down for query performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6066] HoodieTableSource supports parquet predicate push down (#8437)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 9fa00b7b154 [HUDI-6066] HoodieTableSource supports parquet predicate push down (#8437) 9fa00b7b154 is described below commit 9fa00b7b1547ff46a1bea6d329e20dd702ff90b5 Author: Nicholas Jiang AuthorDate: Fri Sep 1 09:36:45 2023 +0800 [HUDI-6066] HoodieTableSource supports parquet predicate push down (#8437) --- .../apache/hudi/source/ExpressionPredicates.java | 654 + .../org/apache/hudi/table/HoodieTableSource.java | 18 +- .../apache/hudi/table/format/RecordIterators.java | 60 +- .../hudi/table/format/cdc/CdcInputFormat.java | 11 +- .../table/format/cow/CopyOnWriteInputFormat.java | 9 +- .../table/format/mor/MergeOnReadInputFormat.java | 17 +- .../hudi/source/TestExpressionPredicates.java | 167 ++ .../apache/hudi/table/ITTestHoodieDataSource.java | 14 + .../apache/hudi/table/TestHoodieTableSource.java | 23 + .../table/format/cow/ParquetSplitReaderUtil.java | 10 +- .../reader/ParquetColumnarRowSplitReader.java | 10 +- .../table/format/cow/ParquetSplitReaderUtil.java | 10 +- .../reader/ParquetColumnarRowSplitReader.java | 10 +- .../table/format/cow/ParquetSplitReaderUtil.java | 10 +- .../reader/ParquetColumnarRowSplitReader.java | 10 +- .../table/format/cow/ParquetSplitReaderUtil.java | 10 +- .../reader/ParquetColumnarRowSplitReader.java | 10 +- .../table/format/cow/ParquetSplitReaderUtil.java | 10 +- .../reader/ParquetColumnarRowSplitReader.java | 10 +- 19 files changed, 1037 insertions(+), 36 deletions(-) diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java new file mode 100644 index 000..046e4b739ad --- /dev/null +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java @@ -0,0 +1,654 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.source; + +import org.apache.flink.table.expressions.CallExpression; +import org.apache.flink.table.expressions.Expression; +import org.apache.flink.table.expressions.FieldReferenceExpression; +import org.apache.flink.table.expressions.ResolvedExpression; +import org.apache.flink.table.expressions.ValueLiteralExpression; +import org.apache.flink.table.functions.BuiltInFunctionDefinitions; +import org.apache.flink.table.functions.FunctionDefinition; +import org.apache.flink.table.types.logical.LogicalType; +import org.apache.parquet.filter2.predicate.FilterPredicate; +import org.apache.parquet.filter2.predicate.Operators; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.Serializable; +import java.util.Arrays; +import java.util.List; +import java.util.Objects; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +import static org.apache.hudi.common.util.ValidationUtils.checkState; +import static org.apache.hudi.util.ExpressionUtils.getValueFromLiteral; +import static org.apache.parquet.filter2.predicate.FilterApi.and; +import static org.apache.parquet.filter2.predicate.FilterApi.binaryColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.booleanColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.doubleColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.eq; +import static org.apache.parquet.filter2.predicate.FilterApi.floatColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.gt; +import static org.apache.parquet.filter2.predicate.FilterApi.gtEq; +import static org.apache.parquet.filter2.predicate.FilterApi.intColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.longColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.lt; +import static org.apache.parquet.filter2.predicate.FilterApi.ltEq; +import static org.apache.parquet.filter2.predicate.FilterApi.not; +import static
[GitHub] [hudi] danny0405 merged pull request #8437: [HUDI-6066] HoodieTableSource supports parquet predicate push down
danny0405 merged PR #8437: URL: https://github.com/apache/hudi/pull/8437 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9475: [HUDI-6766] Fixing mysql debezium data loss
danny0405 commented on PR #9475: URL: https://github.com/apache/hudi/pull/9475#issuecomment-1701989413 There are test failures in Travis. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on a diff in pull request #9515: [HUDI-2141] Support flink compaction metrics
stream2000 commented on code in PR #9515: URL: https://github.com/apache/hudi/pull/9515#discussion_r1312437663 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/metrics/FlinkWriteMetrics.java: ## @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.metrics; + +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; + +import org.apache.flink.metrics.MetricGroup; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.text.ParseException; + +/** + * Common flink write commit metadata metrics + */ +public class FlinkWriteMetrics extends HoodieFlinkMetrics { + + private static final Logger LOG = LoggerFactory.getLogger(FlinkWriteMetrics.class); + + protected final String actionType; + + private long totalPartitionsWritten; + private long totalFilesInsert; + private long totalFilesUpdate; + private long totalRecordsWritten; + private long totalUpdateRecordsWritten; + private long totalInsertRecordsWritten; + private long totalBytesWritten; + private long totalScanTime; + private long totalCreateTime; + private long totalUpsertTime; + private long totalCompactedRecordsUpdated; + private long totalLogFilesCompacted; + private long totalLogFilesSize; + private long commitLatencyInMs; + private long commitFreshnessInMs; + private long commitEpochTimeInMs; + private long durationInMs; + + public FlinkWriteMetrics(MetricGroup metricGroup, String actionType) { +super(metricGroup); +this.actionType = actionType; + } + + @Override + public void registerMetrics() { +// register commit gauge +metricGroup.gauge(getMetricsName(actionType, "totalPartitionsWritten"), () -> totalPartitionsWritten); +metricGroup.gauge(getMetricsName(actionType, "totalFilesInsert"), () -> totalFilesInsert); +metricGroup.gauge(getMetricsName(actionType, "totalFilesUpdate"), () -> totalFilesUpdate); +metricGroup.gauge(getMetricsName(actionType, "totalRecordsWritten"), () -> totalRecordsWritten); +metricGroup.gauge(getMetricsName(actionType, "totalUpdateRecordsWritten"), () -> totalUpdateRecordsWritten); +metricGroup.gauge(getMetricsName(actionType, "totalInsertRecordsWritten"), () -> totalInsertRecordsWritten); +metricGroup.gauge(getMetricsName(actionType, "totalBytesWritten"), () -> totalBytesWritten); +metricGroup.gauge(getMetricsName(actionType, "totalScanTime"), () -> totalScanTime); +metricGroup.gauge(getMetricsName(actionType, "totalCreateTime"), () -> totalCreateTime); +metricGroup.gauge(getMetricsName(actionType, "totalUpsertTime"), () -> totalUpsertTime); +metricGroup.gauge(getMetricsName(actionType, "totalCompactedRecordsUpdated"), () -> totalCompactedRecordsUpdated); Review Comment: Yes of course. Will delete them later -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9515: [HUDI-2141] Support flink compaction metrics
danny0405 commented on code in PR #9515: URL: https://github.com/apache/hudi/pull/9515#discussion_r1312435023 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/metrics/FlinkWriteMetrics.java: ## @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.metrics; + +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.Pair; + +import org.apache.flink.metrics.MetricGroup; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.text.ParseException; + +/** + * Common flink write commit metadata metrics + */ +public class FlinkWriteMetrics extends HoodieFlinkMetrics { + + private static final Logger LOG = LoggerFactory.getLogger(FlinkWriteMetrics.class); + + protected final String actionType; + + private long totalPartitionsWritten; + private long totalFilesInsert; + private long totalFilesUpdate; + private long totalRecordsWritten; + private long totalUpdateRecordsWritten; + private long totalInsertRecordsWritten; + private long totalBytesWritten; + private long totalScanTime; + private long totalCreateTime; + private long totalUpsertTime; + private long totalCompactedRecordsUpdated; + private long totalLogFilesCompacted; + private long totalLogFilesSize; + private long commitLatencyInMs; + private long commitFreshnessInMs; + private long commitEpochTimeInMs; + private long durationInMs; + + public FlinkWriteMetrics(MetricGroup metricGroup, String actionType) { +super(metricGroup); +this.actionType = actionType; + } + + @Override + public void registerMetrics() { +// register commit gauge +metricGroup.gauge(getMetricsName(actionType, "totalPartitionsWritten"), () -> totalPartitionsWritten); +metricGroup.gauge(getMetricsName(actionType, "totalFilesInsert"), () -> totalFilesInsert); +metricGroup.gauge(getMetricsName(actionType, "totalFilesUpdate"), () -> totalFilesUpdate); +metricGroup.gauge(getMetricsName(actionType, "totalRecordsWritten"), () -> totalRecordsWritten); +metricGroup.gauge(getMetricsName(actionType, "totalUpdateRecordsWritten"), () -> totalUpdateRecordsWritten); +metricGroup.gauge(getMetricsName(actionType, "totalInsertRecordsWritten"), () -> totalInsertRecordsWritten); +metricGroup.gauge(getMetricsName(actionType, "totalBytesWritten"), () -> totalBytesWritten); +metricGroup.gauge(getMetricsName(actionType, "totalScanTime"), () -> totalScanTime); +metricGroup.gauge(getMetricsName(actionType, "totalCreateTime"), () -> totalCreateTime); +metricGroup.gauge(getMetricsName(actionType, "totalUpsertTime"), () -> totalUpsertTime); +metricGroup.gauge(getMetricsName(actionType, "totalCompactedRecordsUpdated"), () -> totalCompactedRecordsUpdated); Review Comment: Can we drop these write metrics first until we introduce the coordinator metrics? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #9591: [SUPPORT] persist write status RDD in spark compaction job caused the resources could not be released in time
danny0405 commented on issue #9591: URL: https://github.com/apache/hudi/issues/9591#issuecomment-1701976717 cc @nsivabalan , guess the analysis is reasonable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #9587: [SUPPORT] hoodie.datasource.write.keygenerator.class config not work in bulk_insert mode
danny0405 commented on issue #9587: URL: https://github.com/apache/hudi/issues/9587#issuecomment-1701974951 You are right, because you only have one primary key field: `eid`, maybe you should set up the spark key generator as simple. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (59f7d2806bf -> c4c5f3e8667)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 59f7d2806bf [HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC enabled (#9519) add c4c5f3e8667 [MINOR] Fix failing schema evolution tests in Flink versions < 1.17 (#9586) No new revisions were added by this update. Summary of changes: .../apache/hudi/table/ITTestSchemaEvolution.java | 23 +++--- 1 file changed, 12 insertions(+), 11 deletions(-)
[GitHub] [hudi] danny0405 merged pull request #9586: [MINOR] Fix failing schema evolution tests in Flink versions < 1.17
danny0405 merged PR #9586: URL: https://github.com/apache/hudi/pull/9586 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf commented on a diff in pull request #9558: [HUDI-6481] Support run multi tables services in a single spark job
leesf commented on code in PR #9558: URL: https://github.com/apache/hudi/pull/9558#discussion_r1312423032 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/multitable/HoodieMultiTableServicesMain.java: ## @@ -0,0 +1,255 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.utilities.multitable; + +import org.apache.hudi.common.config.TypedProperties; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.utilities.HoodieCompactor; +import org.apache.hudi.utilities.IdentitySplitter; +import org.apache.hudi.utilities.UtilHelpers; +import org.apache.hudi.utilities.streamer.HoodieStreamer; + +import com.beust.jcommander.JCommander; +import com.beust.jcommander.Parameter; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.spark.api.java.JavaSparkContext; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.io.Serializable; +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; +import java.util.StringJoiner; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; + +/** + * Main function for executing multi-table services + */ +public class HoodieMultiTableServicesMain { + private static final Logger LOG = LoggerFactory.getLogger(HoodieStreamer.class); + final Config cfg; + final TypedProperties props; + + private final JavaSparkContext jsc; + + private ScheduledExecutorService executorService; + + private void batchRunTableServices(List tablePaths) throws InterruptedException, ExecutionException { +ExecutorService executorService = Executors.newFixedThreadPool(cfg.poolSize); +List> futures = tablePaths.stream() +.map(basePath -> CompletableFuture.runAsync( +() -> MultiTableServiceUtils.buildTableServicePipeline(jsc, basePath, cfg, props).execute(), Review Comment: should early exit if no services is enabled? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9572: [WIP][HUDI-6702] Remove unnecessary calls of `getInsertValue` api from HoodieRecordPayload
hudi-bot commented on PR #9572: URL: https://github.com/apache/hudi/pull/9572#issuecomment-1701933901 ## CI report: * ad05887b523496f59ac8b6e976183d6c325ed94d UNKNOWN * 93813ed1bd85993d5e0674f5ff4e01964338cd49 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19586) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9546: [HUDI-6397] [HUDI-6759] Fixing misc bugs w/ metadata table
hudi-bot commented on PR #9546: URL: https://github.com/apache/hudi/pull/9546#issuecomment-1701933763 ## CI report: * 5472cd308f526d6679eba8682957b36d46679f62 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19585) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9581: [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-1701927483 ## CI report: * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN * e286659cb1e1cb69126b8ec09d4e2a62969ce9d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19587) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9521: [HUDI-6736] Fixing rollback completion and commit timeline files removal
hudi-bot commented on PR #9521: URL: https://github.com/apache/hudi/pull/9521#issuecomment-1701927027 ## CI report: * c22c23106d356cd295067d1330828384c8bdb902 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19584) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #8437: [HUDI-6066] HoodieTableSource supports parquet predicate push down
yihua commented on PR #8437: URL: https://github.com/apache/hudi/pull/8437#issuecomment-1701917436 @danny0405 could you help review this again? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua merged pull request #9519: [HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC enabled
yihua merged PR #9519: URL: https://github.com/apache/hudi/pull/9519 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC enabled (#9519)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 59f7d2806bf [HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC enabled (#9519) 59f7d2806bf is described below commit 59f7d2806bfc2d402dc8f5694dcb9d345e3d5a55 Author: Aditya Goenka <63430370+ad1happy...@users.noreply.github.com> AuthorDate: Fri Sep 1 04:47:48 2023 +0530 [HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC enabled (#9519) Co-authored-by: Y Ethan Guo --- .../hudi/io/HoodieMergeHandleWithChangeLog.java| 2 +- .../functional/cdc/TestCDCDataFrameSuite.scala | 56 +- 2 files changed, 56 insertions(+), 2 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java index d610891c2ca..f8669416f0c 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java @@ -103,7 +103,7 @@ public class HoodieMergeHandleWithChangeLog extends HoodieMergeHandl // TODO Remove these unnecessary newInstance invocations HoodieRecord savedRecord = newRecord.newInstance(); super.writeInsertRecord(newRecord); -if (!HoodieOperation.isDelete(newRecord.getOperation())) { +if (!HoodieOperation.isDelete(newRecord.getOperation()) && !savedRecord.isDelete(schema, config.getPayloadConfig().getProps())) { cdcLogger.put(newRecord, null, savedRecord.toIndexedRecord(schema, config.getPayloadConfig().getProps()).map(HoodieAvroIndexedRecord::getData)); } } diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala index 36629687106..aac836d8c3a 100644 --- a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala +++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala @@ -26,7 +26,8 @@ import org.apache.hudi.common.table.cdc.{HoodieCDCOperation, HoodieCDCSupplement import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient, TableSchemaResolver} import org.apache.hudi.common.testutils.HoodieTestDataGenerator import org.apache.hudi.common.testutils.RawTripTestPayload.{deleteRecordsToStrings, recordsToStrings} -import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.{Row, SaveMode} +import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.junit.jupiter.api.Assertions.{assertEquals, assertFalse, assertTrue} import org.junit.jupiter.params.ParameterizedTest import org.junit.jupiter.params.provider.{CsvSource, EnumSource} @@ -634,4 +635,57 @@ class TestCDCDataFrameSuite extends HoodieCDCTestBase { val cdcDataOnly2 = cdcDataFrame((commitTime2.toLong - 1).toString) assertCDCOpCnt(cdcDataOnly2, insertedCnt2, updatedCnt2, 0) } + + @ParameterizedTest + @EnumSource(classOf[HoodieCDCSupplementalLoggingMode]) + def testCDCWithAWSDMSPayload(loggingMode: HoodieCDCSupplementalLoggingMode): Unit = { +val options = Map( + "hoodie.table.name" -> "test", + "hoodie.datasource.write.recordkey.field" -> "id", + "hoodie.datasource.write.precombine.field" -> "replicadmstimestamp", + "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.NonpartitionedKeyGenerator", + "hoodie.datasource.write.partitionpath.field" -> "", + "hoodie.datasource.write.payload.class" -> "org.apache.hudi.common.model.AWSDmsAvroPayload", + "hoodie.table.cdc.enabled" -> "true", + "hoodie.table.cdc.supplemental.logging.mode" -> "data_before_after" +) + +val data: Seq[(String, String, String, String)] = Seq( + ("1", "I", "2023-06-14 15:46:06.953746", "A"), + ("2", "I", "2023-06-14 15:46:07.953746", "B"), + ("3", "I", "2023-06-14 15:46:08.953746", "C") +) + +val schema: StructType = StructType(Seq( + StructField("id", StringType), + StructField("Op", StringType), + StructField("replicadmstimestamp", StringType), + StructField("code", StringType) +)) + +val df = spark.createDataFrame(data.map(Row.fromTuple), schema) +df.write + .format("org.apache.hudi") + .option("hoodie.datasource.write.operation", "upsert") + .options(options) + .mode("append") + .save(basePath) + +assertEquals(spark.read.format("org.apache.hudi").load(basePath).count(), 3) + +val newData: Seq[(String, String, St
[GitHub] [hudi] yihua commented on a diff in pull request #9519: [HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC enabled
yihua commented on code in PR #9519: URL: https://github.com/apache/hudi/pull/9519#discussion_r1312375700 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java: ## @@ -103,7 +103,7 @@ protected void writeInsertRecord(HoodieRecord newRecord) throws IOException { // TODO Remove these unnecessary newInstance invocations HoodieRecord savedRecord = newRecord.newInstance(); super.writeInsertRecord(newRecord); -if (!HoodieOperation.isDelete(newRecord.getOperation())) { +if (!HoodieOperation.isDelete(newRecord.getOperation()) && !savedRecord.isDelete(schema, config.getPayloadConfig().getProps())) { Review Comment: I think we should (i.e., adding `else` block to handle the deletes). However, it's a different issue we need to tackle. I'll follow up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9519: [HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC enabled
hudi-bot commented on PR #9519: URL: https://github.com/apache/hudi/pull/9519#issuecomment-1701899670 ## CI report: * c727303e24756595101e6b8319a250a6476aa012 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #9519: [HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC enabled
yihua commented on PR #9519: URL: https://github.com/apache/hudi/pull/9519#issuecomment-1701899052 CI is green https://github.com/apache/hudi/assets/2497195/a1a6470d-f015-4687-a9b7-a2e01116b28e";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9571: Enabling comprehensive schema evolution in delta streamer code
hudi-bot commented on PR #9571: URL: https://github.com/apache/hudi/pull/9571#issuecomment-1701887365 ## CI report: * 3af6011d72b294b0995d52be40a6d91e6eff9a1b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19561) * 871ff24da9c3800b8f19bdabda140621549aaf3b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19588) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9585: [HUDI-6809] Optimizing the judgment of generating clustering plans
hudi-bot commented on PR #9585: URL: https://github.com/apache/hudi/pull/9585#issuecomment-1701850344 ## CI report: * 9a2675de94095d2baac571a6dd71ec368b8a9e8c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19582) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9571: Enabling comprehensive schema evolution in delta streamer code
hudi-bot commented on PR #9571: URL: https://github.com/apache/hudi/pull/9571#issuecomment-1701850224 ## CI report: * 3af6011d72b294b0995d52be40a6d91e6eff9a1b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19561) * 871ff24da9c3800b8f19bdabda140621549aaf3b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9581: [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-1701842618 ## CI report: * 1208189ffb60441f9544933a2446ad194509c391 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19565) * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN * e286659cb1e1cb69126b8ec09d4e2a62969ce9d4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19587) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9581: [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-1701802062 ## CI report: * 1208189ffb60441f9544933a2446ad194509c391 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19565) * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN * e286659cb1e1cb69126b8ec09d4e2a62969ce9d4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9581: [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes
hudi-bot commented on PR #9581: URL: https://github.com/apache/hudi/pull/9581#issuecomment-1701789789 ## CI report: * 1208189ffb60441f9544933a2446ad194509c391 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19565) * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9572: [WIP][HUDI-6702] Remove unnecessary calls of `getInsertValue` api from HoodieRecordPayload
hudi-bot commented on PR #9572: URL: https://github.com/apache/hudi/pull/9572#issuecomment-1701789643 ## CI report: * ad05887b523496f59ac8b6e976183d6c325ed94d UNKNOWN * cf848446b9c837be3c1c2fdc7930b26f920a0754 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19563) * 93813ed1bd85993d5e0674f5ff4e01964338cd49 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19586) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9592: automatically create a database when using the flink catalog dfs mode
hudi-bot commented on PR #9592: URL: https://github.com/apache/hudi/pull/9592#issuecomment-1701777353 ## CI report: * c961be19038e5600f418ef660b7ede740cef76c6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19581) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9572: [WIP][HUDI-6702] Remove unnecessary calls of `getInsertValue` api from HoodieRecordPayload
hudi-bot commented on PR #9572: URL: https://github.com/apache/hudi/pull/9572#issuecomment-1701777190 ## CI report: * ad05887b523496f59ac8b6e976183d6c325ed94d UNKNOWN * cf848446b9c837be3c1c2fdc7930b26f920a0754 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19563) * 93813ed1bd85993d5e0674f5ff4e01964338cd49 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6795) Implement generation of record_positions for updates and deletes on write path
[ https://issues.apache.org/jira/browse/HUDI-6795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6795: Status: Patch Available (was: In Progress) > Implement generation of record_positions for updates and deletes on write path > -- > > Key: HUDI-6795 > URL: https://issues.apache.org/jira/browse/HUDI-6795 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6785) Introduce an engine-agnostic FileGroupReader for snapshot read
[ https://issues.apache.org/jira/browse/HUDI-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6785: Status: In Progress (was: Open) > Introduce an engine-agnostic FileGroupReader for snapshot read > -- > > Key: HUDI-6785 > URL: https://issues.apache.org/jira/browse/HUDI-6785 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-5463) Apply rollback commits from data table as rollbacks in MDT instead of Delta commit
[ https://issues.apache.org/jira/browse/HUDI-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-5463: --- Assignee: (was: sivabalan narayanan) > Apply rollback commits from data table as rollbacks in MDT instead of Delta > commit > -- > > Key: HUDI-5463 > URL: https://issues.apache.org/jira/browse/HUDI-5463 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: sivabalan narayanan >Priority: Critical > Fix For: 0.14.0 > > > As of now, any rollback in DT is another DC in MDT. this may not scale for > record level index in MDT since we have to add 1000s of delete records and > finally have to resolve all valid and invalid records. So, its better to > rollback the commit in MDT as well instead of doing a DC. > > Impact: > record level index is unusable w/o this change. While fixing other rollback > related tickets, do consider this as a possible option if this simplifies > other fixes. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6795) Implement generation of record_positions for updates and deletes on write path
[ https://issues.apache.org/jira/browse/HUDI-6795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6795: Reviewers: sivabalan narayanan > Implement generation of record_positions for updates and deletes on write path > -- > > Key: HUDI-6795 > URL: https://issues.apache.org/jira/browse/HUDI-6795 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] linliu-code commented on pull request #9572: [WIP][HUDI-6702] Remove unnecessary calls of `getInsertValue` api from HoodieRecordPayload
linliu-code commented on PR #9572: URL: https://github.com/apache/hudi/pull/9572#issuecomment-1701748977 @yihua @danny0405 please comment! Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org