date:20230831

[jira] [Updated] (HUDI-1739) Standardize usage of replacecommit files across the code base

2023-08-31 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-1739:
--
Reviewers: Sagar Sumit

> Standardize usage of replacecommit files across the code base
> -
>
> Key: HUDI-1739
> URL: https://issues.apache.org/jira/browse/HUDI-1739
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Jagmeet Bali
>Assignee: Susu Dong
>Priority: Critical
>
> Fixes can be to 
>  # Ignore empty replacecommit.requested files.
>  # Standardise the replacecommit.requested format across all invocations be 
> it from clustering or this use case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9558: [HUDI-6481] Support run multi tables services in a single spark job

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9558:
URL: https://github.com/apache/hudi/pull/9558#issuecomment-1702225241

   
   ## CI report:
   
   * 1640805e55e219b1c512bde9650849613c03e0b9 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19598)
 
   * ffc02724376dc67f1d5426fc1d95cbf1725d0261 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19603)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] empcl commented on a diff in pull request #9592: automatically create a database when using the flink catalog dfs mode

2023-08-31 Thread via GitHub



empcl commented on code in PR #9592:
URL: https://github.com/apache/hudi/pull/9592#discussion_r1312612204


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java:
##


Review Comment:
   because there is already a 
judgment，`org.apache.hudi.table.catalog.TestHoodieCatalog#testDatabaseExists`



##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java:
##


Review Comment:
   because there is already a 
judgment，`org.apache.hudi.table.catalog.TestHoodieCatalog#testDatabaseExists`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9558: [HUDI-6481] Support run multi tables services in a single spark job

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9558:
URL: https://github.com/apache/hudi/pull/9558#issuecomment-1702217716

   
   ## CI report:
   
   * d0a5621c43699e3cd636c99ef6cc048788f04459 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19573)
 
   * 1640805e55e219b1c512bde9650849613c03e0b9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19598)
 
   * ffc02724376dc67f1d5426fc1d95cbf1725d0261 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #9592: automatically create a database when using the flink catalog dfs mode

2023-08-31 Thread via GitHub



danny0405 commented on code in PR #9592:
URL: https://github.com/apache/hudi/pull/9592#discussion_r1312603674


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java:
##


Review Comment:
   Can you add a test case where default database is created?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6732) Handle wildcards for partition paths passed in via spark-sql

2023-08-31 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6732:
-
Fix Version/s: 1.0.0

> Handle wildcards for partition paths passed in via spark-sql
> 
>
> Key: HUDI-6732
> URL: https://issues.apache.org/jira/browse/HUDI-6732
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
> Attachments: image-2023-08-21-14-59-27-095.png
>
>
> The drop partition DDL is not handling wildcards properly, specifically, for 
> partitions with wildcards that are submitted via the Spark-SQL entry point.
>  
> {code:java}
> ALTER TABLE table_x DROP PARTITION(partition_col="*")  {code}
>  
> The Spark-SQL entrypoint will url-encode special characters, causing the * 
> character to be url-encoded to {*}%2A{*}, as such, we will need to handle 
> that too.
>  
> !image-2023-08-21-14-59-27-095.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6732) Handle wildcards for partition paths passed in via spark-sql

2023-08-31 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6732.

Resolution: Fixed

Fixed via master branch: 64a05bc0b874fd2f3ce01c669840bb619550f033

> Handle wildcards for partition paths passed in via spark-sql
> 
>
> Key: HUDI-6732
> URL: https://issues.apache.org/jira/browse/HUDI-6732
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
> Attachments: image-2023-08-21-14-59-27-095.png
>
>
> The drop partition DDL is not handling wildcards properly, specifically, for 
> partitions with wildcards that are submitted via the Spark-SQL entry point.
>  
> {code:java}
> ALTER TABLE table_x DROP PARTITION(partition_col="*")  {code}
>  
> The Spark-SQL entrypoint will url-encode special characters, causing the * 
> character to be url-encoded to {*}%2A{*}, as such, we will need to handle 
> that too.
>  
> !image-2023-08-21-14-59-27-095.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated: [HUDI-6732] Allow wildcards from Spark-SQL entrypoints for drop partition DDL (#9491)

2023-08-31 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 64a05bc0b87 [HUDI-6732] Allow wildcards from Spark-SQL entrypoints for 
drop partition DDL (#9491)
64a05bc0b87 is described below

commit 64a05bc0b874fd2f3ce01c669840bb619550f033
Author: voonhous 
AuthorDate: Fri Sep 1 13:54:27 2023 +0800

[HUDI-6732] Allow wildcards from Spark-SQL entrypoints for drop partition 
DDL (#9491)
---
 .../org/apache/hudi/HoodieSparkSqlWriter.scala |  6 ++--
 .../sql/hudi/TestAlterTableDropPartition.scala | 36 ++
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index cf78e514dda..6d0ce7d16bf 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -606,7 +606,8 @@ object HoodieSparkSqlWriter {
*/
   private def resolvePartitionWildcards(partitions: List[String], jsc: 
JavaSparkContext, cfg: HoodieConfig, basePath: String): List[String] = {
 //find out if any of the input partitions have wildcards
-var (wildcardPartitions, fullPartitions) = partitions.partition(partition 
=> partition.contains("*"))
+//note:spark-sql may url-encode special characters (* -> %2A)
+var (wildcardPartitions, fullPartitions) = partitions.partition(partition 
=> partition.matches(".*(\\*|%2A).*"))
 
 if (wildcardPartitions.nonEmpty) {
   //get list of all partitions
@@ -621,7 +622,8 @@ object HoodieSparkSqlWriter {
 //prevent that from happening. Any text inbetween \\Q and \\E is 
considered literal
 //So we start the string with \\Q and end with \\E and then whenever 
we find a * we add \\E before
 //and \\Q after so all other characters besides .* will be enclosed 
between a set of \\Q \\E
-val regexPartition = "^\\Q" + partition.replace("*", "\\E.*\\Q") + 
"\\E$"
+val wildcardToken: String = if (partition.contains("*")) "*" else "%2A"
+val regexPartition = "^\\Q" + partition.replace(wildcardToken, 
"\\E.*\\Q") + "\\E$"
 
 //filter all partitions with the regex and append the result to the 
list of full partitions
 fullPartitions = 
List.concat(fullPartitions,allPartitions.filter(_.matches(regexPartition)))
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTableDropPartition.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTableDropPartition.scala
index 2261e83f7f9..b421732d270 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTableDropPartition.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestAlterTableDropPartition.scala
@@ -620,4 +620,40 @@ class TestAlterTableDropPartition extends 
HoodieSparkSqlTestBase {
   checkExceptionContain(s"ALTER TABLE $tableName DROP 
PARTITION($partition)")(errMsg)
 }
   }
+
+  test("Test drop partition with wildcards") {
+withRecordType()(withTempDir { tmp =>
+  Seq("cow", "mor").foreach { tableType =>
+val tableName = generateTableName
+spark.sql(
+  s"""
+ |create table $tableName (
+ |  id int,
+ |  name string,
+ |  price double,
+ |  ts long,
+ |  partition_date_col string
+ |) using hudi
+ | location '${tmp.getCanonicalPath}/$tableName'
+ | tblproperties (
+ |  primaryKey ='id',
+ |  type = '$tableType',
+ |  preCombineField = 'ts'
+ | ) partitioned by (partition_date_col)
+ """.stripMargin)
+spark.sql(s"insert into $tableName values " +
+  s"(1, 'a1', 10, 1000, '2023-08-01'), (2, 'a2', 10, 1000, 
'2023-08-02'), (3, 'a3', 10, 1000, '2023-09-01')")
+checkAnswer(s"show partitions $tableName")(
+  Seq("partition_date_col=2023-08-01"),
+  Seq("partition_date_col=2023-08-02"),
+  Seq("partition_date_col=2023-09-01")
+)
+spark.sql(s"alter table $tableName drop 
partition(partition_date_col='2023-08-*')")
+// show partitions will still return all partitions for tests, use 
select distinct as a stop-gap
+checkAnswer(s"select distinct partition_date_col from $tableName")(
+  Seq("2023-09-01")
+)
+  }
+})
+  }
 }

[GitHub] [hudi] danny0405 merged pull request #9491: [HUDI-6732] Allow wildcards from Spark-SQL entrypoints for drop parti…

2023-08-31 Thread via GitHub



danny0405 merged PR #9491:
URL: https://github.com/apache/hudi/pull/9491


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9595: [MINOR] Catch EntityNotFoundException correctly

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9595:
URL: https://github.com/apache/hudi/pull/9595#issuecomment-1702170861

   
   ## CI report:
   
   * 0cf80bdf054737a6f13bccc8250ce1b3686a0e8b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19601)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9592: automatically create a database when using the flink catalog dfs mode

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9592:
URL: https://github.com/apache/hudi/pull/9592#issuecomment-1702170818

   
   ## CI report:
   
   * c961be19038e5600f418ef660b7ede740cef76c6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19581)
 
   * 702653a08249790e738497e49ddc9970613e2343 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19600)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] empcl commented on a diff in pull request #9592: automatically create a database when using the flink catalog dfs mode

2023-08-31 Thread via GitHub



empcl commented on code in PR #9592:
URL: https://github.com/apache/hudi/pull/9592#discussion_r1312545914


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieCatalog.java:
##


Review Comment:
   @danny0405 Hello, currently in the test cases, we should not manually create 
the caatalog+db path, but instead create the db directory by calling the open() 
method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] aajisaka commented on a diff in pull request #9577: [HUDI-6805] Print detailed error message in clustering

2023-08-31 Thread via GitHub



aajisaka commented on code in PR #9577:
URL: https://github.com/apache/hudi/pull/9577#discussion_r1312545323


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java:
##
@@ -241,6 +242,9 @@ public WriteStatus close() throws IOException {
 stat.setTotalWriteBytes(fileSizeInBytes);
 stat.setFileSizeInBytes(fileSizeInBytes);
 stat.setTotalWriteErrors(writeStatus.getTotalErrorRecords());
+for (Pair pair : 
writeStatus.getFailedRecords()) {
+  LOG.error("Failed to write {}", pair.getLeft(), pair.getRight());
+}

Review Comment:
   There's low possibility as Hudi doesn't store all the exception in 
`writeStatus.getFailedRecords()`. By default, 10% of the errors are stored and 
the percentage is configurable via 
`hoodie.memory.writestatus.failure.fraction`. Note that the first error is 
always stored.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9595: [MINOR] Catch EntityNotFoundException correctly

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9595:
URL: https://github.com/apache/hudi/pull/9595#issuecomment-1702164954

   
   ## CI report:
   
   * 0cf80bdf054737a6f13bccc8250ce1b3686a0e8b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9592: automatically create a database when using the flink catalog dfs mode

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9592:
URL: https://github.com/apache/hudi/pull/9592#issuecomment-1702164906

   
   ## CI report:
   
   * c961be19038e5600f418ef660b7ede740cef76c6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19581)
 
   * 702653a08249790e738497e49ddc9970613e2343 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9585: [HUDI-6809] Optimizing the judgment of generating clustering plans

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9585:
URL: https://github.com/apache/hudi/pull/9585#issuecomment-1702164849

   
   ## CI report:
   
   * 67e18f40f585f17a96068ca4737a0dd7d800354e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19593)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9594: [HUDI-6742] Remove the log file appending for multiple instants

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9594:
URL: https://github.com/apache/hudi/pull/9594#issuecomment-1702158693

   
   ## CI report:
   
   * ac71c9982c1d47e3df2332671d1981d1bee51ab7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19599)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY opened a new pull request, #9595: [MINOR] Catch EntityNotFoundException correctly

2023-08-31 Thread via GitHub



CTTY opened a new pull request, #9595:
URL: https://github.com/apache/hudi/pull/9595

   
   ### Change Logs
   
   When table/database is not found when syncing table to Glue, glue should 
return `EntityNotFoundException`.
   After upgrading to AWS SDK V2, Hudi uses `GlueAsyncClient` to get a 
`CompletableFuture`, which would throw `ExecutionException` with 
`EntityNotFoundException` nested when table/database doesn't exist. However, 
existing Hudi code doesn't handle `ExecutionException` and would fail the job.
   
   Sample exception:
   ```
   org.apache.hudi.exception.HoodieMetaSyncException: Could not sync using the 
meta sync class org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
 at 
org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:81)
 at 
org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:959)
 at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
 at 
org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:957)
 at 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:1055)
 at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:409)
 at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
 at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
 at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104)
 at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
 at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
 at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160)
 at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
 at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69)
 at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101)
 at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:554)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:107)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:554)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:530)
 at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:97)
 at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:84)
 at 
org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82)
 at 
org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142)
 at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856)
 at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387)
 at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360)
 at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.s

[GitHub] [hudi] voonhous commented on pull request #9491: [HUDI-6732] Allow wildcards from Spark-SQL entrypoints for drop parti…

2023-08-31 Thread via GitHub



voonhous commented on PR #9491:
URL: https://github.com/apache/hudi/pull/9491#issuecomment-1702138452

   @danny0405 Gentle reminder, CI is green.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9558: [HUDI-6481] Support run multi tables services in a single spark job

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9558:
URL: https://github.com/apache/hudi/pull/9558#issuecomment-1702134207

   
   ## CI report:
   
   * d0a5621c43699e3cd636c99ef6cc048788f04459 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19573)
 
   * 1640805e55e219b1c512bde9650849613c03e0b9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19598)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #9590: [HUDI-6780] Introduce enums instead of classnames in table properties

2023-08-31 Thread via GitHub



danny0405 commented on code in PR #9590:
URL: https://github.com/apache/hudi/pull/9590#discussion_r1312518092


##
hudi-common/src/main/java/org/apache/hudi/common/model/RecordPayloadType.java:
##
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.hudi.common.config.EnumDescription;
+import org.apache.hudi.common.config.EnumFieldDescription;
+import org.apache.hudi.common.model.debezium.MySqlDebeziumAvroPayload;
+import org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload;
+
+/**
+ * Payload to use for record.
+ */
+@EnumDescription("Payload to use for merging records")
+public enum RecordPayloadType {
+  @EnumFieldDescription("Provides support for seamlessly applying changes 
captured via Amazon Database Migration Service onto S3.")
+  AWS_DMS_AVRO(AWSDmsAvroPayload.class.getName()),
+
+  @EnumFieldDescription("Honors ordering field in both preCombine and 
combineAndGetUpdateValue.")
+  HOODIE_AVRO_DEFAULT(DefaultHoodieRecordPayload.class.getName()),

Review Comment:
   Are these options expected to be used by users? Then there might be 
in-consistency for table config and write config, for write config, we still 
prefer the class name ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9594: [HUDI-6742] Remove the log file appending for multiple instants

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9594:
URL: https://github.com/apache/hudi/pull/9594#issuecomment-1702129341

   
   ## CI report:
   
   * ac71c9982c1d47e3df2332671d1981d1bee51ab7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9558: [HUDI-6481] Support run multi tables services in a single spark job

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9558:
URL: https://github.com/apache/hudi/pull/9558#issuecomment-1702129218

   
   ## CI report:
   
   * d0a5621c43699e3cd636c99ef6cc048788f04459 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19573)
 
   * 1640805e55e219b1c512bde9650849613c03e0b9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #9515: [HUDI-2141] Support flink compaction metrics

2023-08-31 Thread via GitHub



danny0405 commented on PR #9515:
URL: https://github.com/apache/hudi/pull/9515#issuecomment-1702127608

   > Tested locally and only these four metrics are useless. Remove them until 
we support coordinator metrics. @danny0405 What do you think?
   
   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 closed pull request #9475: [HUDI-6766] Fixing mysql debezium data loss

2023-08-31 Thread via GitHub



danny0405 closed pull request #9475: [HUDI-6766] Fixing mysql debezium data 
loss 
URL: https://github.com/apache/hudi/pull/9475


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #9587: [SUPPORT] hoodie.datasource.write.keygenerator.class config not work in bulk_insert mode

2023-08-31 Thread via GitHub



danny0405 commented on issue #9587:
URL: https://github.com/apache/hudi/issues/9587#issuecomment-1702126617

   > Myabe i can fix it make simple key generator support mult partition keys
   
   Makes sense to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] punish-yh commented on issue #9587: [SUPPORT] hoodie.datasource.write.keygenerator.class config not work in bulk_insert mode

2023-08-31 Thread via GitHub



punish-yh commented on issue #9587:
URL: https://github.com/apache/hudi/issues/9587#issuecomment-1702123005

   > You are right, because you only have one primary key field: `eid`, maybe 
you should set up the spark key generator as simple.
   
   Thank you for your reply, I used  
`hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleAvroKeyGenerator` 
run this job again.
   
   bulk_insert job is successful end. but in upsert mode record was writed to 
`__HIVE_DEFAULT_PARTITION__` partition， because I configurate `_db` and 
`_table` field as  partition field. but simple key generator does not split 
partition field, and in getPartitionPath function mismatch fields so that 
return  `__HIVE_DEFAULT_PARTITION__`
   
   
![image](https://github.com/apache/hudi/assets/59658062/af0dfaff-3cc6-4758-b315-c3aaedfe0b14)
   
![image](https://github.com/apache/hudi/assets/59658062/fdc6590d-6c56-4d08-9a44-6725e3b48742)
   
![image](https://github.com/apache/hudi/assets/59658062/998a196e-6f81-4b82-aef8-0c440b7af297)
   
   now , I can use custom key generator to fix my problem.
   
   
   But I would like to ask if this aligns with Simple key generator initial 
design ? Myabe i can fix it make simple key generator support mult partition 
keys


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 opened a new pull request, #9594: [HUDI-6742] Remove the log file appending for multiple instants

2023-08-31 Thread via GitHub



danny0405 opened a new pull request, #9594:
URL: https://github.com/apache/hudi/pull/9594

   ### Change Logs
   
   Remove the log file appending totally to simplify the log file rollback and 
exception handling for reader.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] twlo-sandeep commented on pull request #9475: [HUDI-6766] Fixing mysql debezium data loss

2023-08-31 Thread via GitHub



twlo-sandeep commented on PR #9475:
URL: https://github.com/apache/hudi/pull/9475#issuecomment-1702114672

   > There are test failures in Travis.
   
   @danny0405 I don't see any failed tests in both the failed suites. It looks 
like a time out after running for 5hr+. Can you trigger a rerun of tests?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] stream2000 commented on a diff in pull request #9558: [HUDI-6481] Support run multi tables services in a single spark job

2023-08-31 Thread via GitHub



stream2000 commented on code in PR #9558:
URL: https://github.com/apache/hudi/pull/9558#discussion_r1312487436


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/multitable/MultiTableServiceUtils.java:
##
@@ -0,0 +1,167 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.multitable;
+
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.concurrent.CopyOnWriteArrayList;
+import java.util.stream.Collectors;
+
+import static 
org.apache.hudi.common.table.HoodieTableMetaClient.METAFOLDER_NAME;
+
+/**
+ * Utils for executing multi-table services
+ */
+public class MultiTableServiceUtils {
+
+  public static class Constants {
+public static final String TABLES_TO_BE_SERVED_PROP = 
"hoodie.tableservice.tablesToServe";
+
+public static final String COMMA_SEPARATOR = ",";
+
+private static final int DEFAULT_LISTING_PARALLELISM = 1500;
+  }
+
+  public static List getTablesToBeServedFromProps(TypedProperties 
properties) {
+String combinedTablesString = 
properties.getString(Constants.TABLES_TO_BE_SERVED_PROP);
+if (combinedTablesString == null) {
+  return new ArrayList<>();
+}
+String[] tablesArray = 
combinedTablesString.split(Constants.COMMA_SEPARATOR);
+return Arrays.asList(tablesArray);
+  }
+
+  public static List findHoodieTablesUnderPath(JavaSparkContext jsc, 
String pathStr) {
+Path rootPath = new Path(pathStr);
+SerializableConfiguration conf = new 
SerializableConfiguration(jsc.hadoopConfiguration());
+if (isHoodieTable(rootPath, conf.get())) {
+  return Collections.singletonList(pathStr);
+}
+
+HoodieSparkEngineContext engineContext = new HoodieSparkEngineContext(jsc);
+List hoodieTablePaths = new CopyOnWriteArrayList<>();
+List pathsToList = new CopyOnWriteArrayList<>();
+pathsToList.add(rootPath);
+int listingParallelism = Math.min(Constants.DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
+
+while (!pathsToList.isEmpty()) {
+  // List all directories in parallel
+  List dirToFileListing = engineContext.map(pathsToList, 
path -> {
+FileSystem fileSystem = path.getFileSystem(conf.get());
+return fileSystem.listStatus(path);
+  }, listingParallelism);
+  pathsToList.clear();
+
+  // if current dictionary contains meta folder(.hoodie), add it to 
result. Otherwise, add it to queue
+  List dirs = dirToFileListing.stream().flatMap(Arrays::stream)
+  .filter(FileStatus::isDirectory)
+  .collect(Collectors.toList());
+
+  if (!dirs.isEmpty()) {
+List> dirResults = engineContext.map(dirs, 
fileStatus -> {
+  if (isHoodieTable(fileStatus.getPath(), conf.get())) {

Review Comment:
   Nice catch~ It's not a good design that uses hard-coded magic number, 
already updated the magic number to meaningful enum constants. 
   
   Updated to: 
   ```java
 /**
  * Type of directories when searching hoodie tables under path
  */
 enum DirType {
   HOODIE_TABLE, // previous 0
   NORMAL_DIR, // previous 1 
   META_FOLDER  // previous 2 
 }
   ```
   



##
hudi-utilities/src/main/java/org/apache/hudi/utilities/multitable/HoodieMultiTableServicesMain.java:
##
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except

[GitHub] [hudi] hudi-bot commented on pull request #9553: [HUDI-1517][HUDI-6758][HUDI-6761] Adding support for per-logfile marker to track all log files added by a commit and to assist with rollbacks

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9553:
URL: https://github.com/apache/hudi/pull/9553#issuecomment-1702097156

   
   ## CI report:
   
   * aeac327c3cad812fea5e2bc01c07c1314bbf1838 UNKNOWN
   * 2554ca28ddffba3e8ffb64db090daf85ffae187b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19555)
 
   * 835ac846b8de9a27eac4a1e2e3eb27fbdf55c9dd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19596)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9515: [HUDI-2141] Support flink compaction metrics

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9515:
URL: https://github.com/apache/hudi/pull/9515#issuecomment-1702097038

   
   ## CI report:
   
   * 33ea8bad45355a5cfb69955f372f0e3a87540aae Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19466)
 
   * a11cc23103021a2916d2759bead59b61a80e50f7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19597)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9553: [HUDI-1517][HUDI-6758][HUDI-6761] Adding support for per-logfile marker to track all log files added by a commit and to assist with rollbacks

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9553:
URL: https://github.com/apache/hudi/pull/9553#issuecomment-1702091723

   
   ## CI report:
   
   * aeac327c3cad812fea5e2bc01c07c1314bbf1838 UNKNOWN
   * 2554ca28ddffba3e8ffb64db090daf85ffae187b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19555)
 
   * 835ac846b8de9a27eac4a1e2e3eb27fbdf55c9dd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9515: [HUDI-2141] Support flink compaction metrics

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9515:
URL: https://github.com/apache/hudi/pull/9515#issuecomment-1702091623

   
   ## CI report:
   
   * 33ea8bad45355a5cfb69955f372f0e3a87540aae Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19466)
 
   * a11cc23103021a2916d2759bead59b61a80e50f7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9584: [HUDI-6808] SkipCompaction Config should not affect the stream read of the cow table

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9584:
URL: https://github.com/apache/hudi/pull/9584#issuecomment-1702078216

   
   ## CI report:
   
   * cd3a969fbe188f1bcf77047d898d5d05e3566caa Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19580)
 
   * cba1cba13bbd6ae0fcd237c1bedbc99a626909f3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19594)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-6579] Fix streaming write when meta cols dropped (#9589)

2023-08-31 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 1450b1b04f7 [HUDI-6579] Fix streaming write when meta cols dropped 
(#9589)
1450b1b04f7 is described below

commit 1450b1b04f7feef4e49dabdac3fb062e04a90c58
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Thu Aug 31 21:57:11 2023 -0500

[HUDI-6579] Fix streaming write when meta cols dropped (#9589)
---
 .../main/scala/org/apache/hudi/DefaultSource.scala | 36 +++---
 .../org/apache/hudi/HoodieCreateRecordUtils.scala  | 11 +++
 .../org/apache/hudi/HoodieSparkSqlWriter.scala | 14 -
 3 files changed, 29 insertions(+), 32 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
index 5a0b0a53d33..f982fb1e1c3 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
@@ -19,17 +19,17 @@ package org.apache.hudi
 
 import org.apache.hadoop.fs.Path
 import org.apache.hudi.DataSourceReadOptions._
-import org.apache.hudi.DataSourceWriteOptions.{BOOTSTRAP_OPERATION_OPT_VAL, 
OPERATION, RECORDKEY_FIELD, SPARK_SQL_WRITES_PREPPED_KEY, 
STREAMING_CHECKPOINT_IDENTIFIER}
+import org.apache.hudi.DataSourceWriteOptions.{BOOTSTRAP_OPERATION_OPT_VAL, 
OPERATION, STREAMING_CHECKPOINT_IDENTIFIER}
 import org.apache.hudi.cdc.CDCRelation
 import org.apache.hudi.common.fs.FSUtils
 import org.apache.hudi.common.model.HoodieTableType.{COPY_ON_WRITE, 
MERGE_ON_READ}
-import org.apache.hudi.common.model.{HoodieRecord, WriteConcurrencyMode}
+import org.apache.hudi.common.model.WriteConcurrencyMode
 import org.apache.hudi.common.table.timeline.HoodieInstant
 import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
 import org.apache.hudi.common.util.ConfigUtils
 import org.apache.hudi.common.util.ValidationUtils.checkState
 import org.apache.hudi.config.HoodieBootstrapConfig.DATA_QUERIES_ONLY
-import 
org.apache.hudi.config.HoodieWriteConfig.{SPARK_SQL_MERGE_INTO_PREPPED_KEY, 
WRITE_CONCURRENCY_MODE}
+import org.apache.hudi.config.HoodieWriteConfig.WRITE_CONCURRENCY_MODE
 import org.apache.hudi.exception.HoodieException
 import org.apache.hudi.util.PathUtils
 import org.apache.spark.sql.execution.streaming.{Sink, Source}
@@ -124,21 +124,21 @@ class DefaultSource extends RelationProvider
   }
 
   /**
-* This DataSource API is used for writing the DataFrame at the 
destination. For now, we are returning a dummy
-* relation here because Spark does not really make use of the relation 
returned, and just returns an empty
-* dataset at 
[[org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run()]]. 
This saves us the cost
-* of creating and returning a parquet relation here.
-*
-* TODO: Revisit to return a concrete relation here when we support CREATE 
TABLE AS for Hudi with DataSource API.
-*   That is the only case where Spark seems to actually need a 
relation to be returned here
-*   
[[org.apache.spark.sql.execution.datasources.DataSource.writeAndRead()]]
-*
-* @param sqlContext Spark SQL Context
-* @param mode Mode for saving the DataFrame at the destination
-* @param optParams Parameters passed as part of the DataFrame write 
operation
-* @param rawDf Spark DataFrame to be written
-* @return Spark Relation
-*/
+   * This DataSource API is used for writing the DataFrame at the destination. 
For now, we are returning a dummy
+   * relation here because Spark does not really make use of the relation 
returned, and just returns an empty
+   * dataset at 
[[org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run()]]. 
This saves us the cost
+   * of creating and returning a parquet relation here.
+   *
+   * TODO: Revisit to return a concrete relation here when we support CREATE 
TABLE AS for Hudi with DataSource API.
+   * That is the only case where Spark seems to actually need a relation to be 
returned here
+   * [[org.apache.spark.sql.execution.datasources.DataSource.writeAndRead()]]
+   *
+   * @param sqlContext Spark SQL Context
+   * @param mode   Mode for saving the DataFrame at the destination
+   * @param optParams  Parameters passed as part of the DataFrame write 
operation
+   * @param df Spark DataFrame to be written
+   * @return Spark Relation
+   */
   override def createRelation(sqlContext: SQLContext,
   mode: SaveMode,
   optParams: Map[String, String],
diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apac

[GitHub] [hudi] xushiyan merged pull request #9589: [HUDI-6579] Fix streaming write when meta cols dropped

2023-08-31 Thread via GitHub



xushiyan merged PR #9589:
URL: https://github.com/apache/hudi/pull/9589


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] stream2000 commented on pull request #9515: [HUDI-2141] Support flink compaction metrics

2023-08-31 Thread via GitHub



stream2000 commented on PR #9515:
URL: https://github.com/apache/hudi/pull/9515#issuecomment-1702051289

   
![image](https://github.com/apache/hudi/assets/39240496/56dcc6ee-4045-4f52-acb2-1a5883a9f772)
   Tested locally and only these four metrics are useless. Remove them until we 
support coordinator metrics. @danny0405  What do you think? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] beyond1920 commented on a diff in pull request #7907: [HUDI-6495][RFC-66] Non-blocking multi writer support

2023-08-31 Thread via GitHub



beyond1920 commented on code in PR #7907:
URL: https://github.com/apache/hudi/pull/7907#discussion_r1312463342


##
rfc/rfc-66/rfc-66.md:
##
@@ -0,0 +1,124 @@
+# RFC-66: Lockless Multi Writer
+
+## Proposers
+- @danny0405
+- @ForwardXu
+- @SteNicholas
+
+## Approvers
+-
+
+## Status
+
+JIRA: [Lockless multi writer 
support](https://issues.apache.org/jira/browse/HUDI-5672)
+
+## Abstract
+As you know, Hudi already supports basic OCC with abundant lock providers.
+But for multi streaming ingestion writers, the OCC does not work well because 
the conflicts happen in very high frequency.
+Expand it a little bit, with hashing index, all the writers have deterministic 
hashing algorithm for distributing the records by primary keys,
+all the keys are evenly distributed in all the data buckets, for a single data 
flushing in one writer, almost all the data buckets are appended with new 
inputs,
+so the conflict would very possibility happen for mul-writer because almost 
all the data buckets are being written by multiple writers at the same time;
+For bloom filter index, things are different, but remember that we have a 
small file load rebalance strategy to writer into the **small** bucket in 
higher priority,
+that means, multiple writers prune to write into the same **small** buckets at 
the same time, that's how conflicts happen.
+
+In general, for multiple streaming writers ingestion, OCC is not very feasible 
in production, in this RFC, we propose a non-blocking solution for streaming 
ingestion.
+
+## Background
+
+Streaming jobs are naturally suitable for data ingestion, it has no complexity 
of pipeline orchestration and has a smother write workload.
+Most of the raw data set we are handling today are generating all the time in 
streaming way.
+
+Based on that, many requests for multiple writers' ingestion are derived. With 
multi-writer ingestion, several streaming events with the same schema can be 
drained into one Hudi table,
+the Hudi table kind of becomes a UNION table view for all the input data set. 
This is a very common use case because in reality, the data sets are usually 
scattered all over the data sources.
+
+Another very useful use case we wanna unlock is the real-time data set join. 
One of the biggest pain point in streaming computation is the dataset join,
+the engine like Flink has basic supports for all kind of SQL JOINs, but it 
stores the input records within its inner state-backend which is a huge cost 
for pure data join with no additional computations.
+In [HUDI-3304](https://issues.apache.org/jira/browse/HUDI-3304), we introduced 
a `PartialUpdateAvroPayload`, in combination with the lockless multi-writer,
+we can implement N-ways data sources join in real-time! Hudi would take care 
of the payload join during compaction service procedure.
+
+## Design
+
+### The Precondition
+
+ MOR Table Type Is Required
+
+The table type must be `MERGE_ON_READ`, so that we can defer the conflict 
resolution to the compaction phase. The compaction service would resolve the 
conflicts of the same keys by respecting the event time sequence of the events.
+
+ Deterministic Bucketing Strategy
+
+Deterministic bucketing strategy is required, because the same records keys 
from different writers are desired to be distributed into the same bucket, not 
only for UPSERTs, but also for all the new INSERTs.
+
+ Lazy Cleaning Strategy
+
+Config the cleaning strategy as lazy so that the pending instants are not 
rolled back by the other active writers.
+
+### Basic Work Flow
+
+ Writing Log Files Separately In Sequence
+
+Basically, each writer flushes the log files in sequence, the log file rolls 
over for different versioning number,
+a pivotal thing needs to note here is that we need to make the write_token 
unique for the same version log files with the same base instant time,
+so that the file name does not conflict for the writers.
+
+The log files generated by a single writer can still preserve the sequence by 
versioning number, which is important if the natual order is needed for single 
writer events.
+
+![multi-writer](multi_writer.png)
+
+### The Compaction Procedure
+
+The compaction service is the duty role that actually resoves the conflicts. 
Within a file group, it sorts the files then merge all the record payloads for 
a record key.
+The event time sequence is respected by combining the payloads with even time 
field provided by the payload (known as the `preCombine` field in Hudi).
+
+![compaction procedure](compaction.png)
+
+ Non-Serial Compaction Plan Schedule
+Currently, the compaction plan scheduling must be in serial order with the 
writers, that means, while scheduling the compaction plan, no ongoing writers 
should be writing to
+the table. This restriction makes the compaction almost impossible for multi 
streaming writers because there is always an instant writing to the table for 
streaming ingestion.
+
+In order to unblock the compaction

[jira] [Updated] (HUDI-6702) Extend merge API to support all merging operations

2023-08-31 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6702:

Reviewers: Ethan Guo

> Extend merge API to support all merging operations
> --
>
> Key: HUDI-6702
> URL: https://issues.apache.org/jira/browse/HUDI-6702
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Sagar Sumit
>Assignee: Lin Liu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> See this issue for more details- [https://github.com/apache/hudi/issues/9430]
> We may have to introduce a new API or figure out a way for the current merger 
> to skip empty records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6784) Support custom logic for deletion

2023-08-31 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6784:

Reviewers: Ethan Guo

> Support custom logic for deletion
> -
>
> Key: HUDI-6784
> URL: https://issues.apache.org/jira/browse/HUDI-6784
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> Add `Optional<>` for newer parameter in merger. If newer is empty, then it 
> means this is a deletion operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9585: [HUDI-6809] Optimizing the judgment of generating clustering plans

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9585:
URL: https://github.com/apache/hudi/pull/9585#issuecomment-1702018623

   
   ## CI report:
   
   * 9a2675de94095d2baac571a6dd71ec368b8a9e8c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19582)
 
   * 67e18f40f585f17a96068ca4737a0dd7d800354e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19593)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9584: [HUDI-6808] SkipCompaction Config should not affect the stream read of the cow table

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9584:
URL: https://github.com/apache/hudi/pull/9584#issuecomment-1702018596

   
   ## CI report:
   
   * cd3a969fbe188f1bcf77047d898d5d05e3566caa Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19580)
 
   * cba1cba13bbd6ae0fcd237c1bedbc99a626909f3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9571: Enabling comprehensive schema evolution in delta streamer code

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9571:
URL: https://github.com/apache/hudi/pull/9571#issuecomment-1702018490

   
   ## CI report:
   
   * 871ff24da9c3800b8f19bdabda140621549aaf3b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19588)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #9589: [HUDI-6579] Fix streaming write when meta cols dropped

2023-08-31 Thread via GitHub



nsivabalan commented on code in PR #9589:
URL: https://github.com/apache/hudi/pull/9589#discussion_r1312461061


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCreateRecordUtils.scala:
##
@@ -98,7 +95,7 @@ object HoodieCreateRecordUtils {
   }
 }
 // we can skip key generator for prepped flow
-val usePreppedInsteadOfKeyGen = preppedSparkSqlWrites && 
preppedWriteOperation
+val usePreppedInsteadOfKeyGen = preppedSparkSqlWrites || 
preppedWriteOperation

Review Comment:
   yes. this looks good



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-6785) Introduce an engine-agnostic FileGroupReader for snapshot read

2023-08-31 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6785:
---

Assignee: Ethan Guo

> Introduce an engine-agnostic FileGroupReader for snapshot read
> --
>
> Key: HUDI-6785
> URL: https://issues.apache.org/jira/browse/HUDI-6785
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9585: [HUDI-6809] Optimizing the judgment of generating clustering plans

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9585:
URL: https://github.com/apache/hudi/pull/9585#issuecomment-1702012580

   
   ## CI report:
   
   * 9a2675de94095d2baac571a6dd71ec368b8a9e8c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19582)
 
   * 67e18f40f585f17a96068ca4737a0dd7d800354e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zhuanshenbsj1 commented on a diff in pull request #9584: [HUDI-6808] SkipCompaction Config should not affect the stream read of the cow table

2023-08-31 Thread via GitHub



zhuanshenbsj1 commented on code in PR #9584:
URL: https://github.com/apache/hudi/pull/9584#discussion_r1312456055


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java:
##
@@ -601,9 +602,9 @@ public List filterInstantsWithRange(
* @return the filtered timeline
*/
   @VisibleForTesting
-  public HoodieTimeline filterInstantsAsPerUserConfigs(HoodieTimeline 
timeline) {
+  public HoodieTimeline filterInstantsAsPerUserConfigs(HoodieTimeline 
timeline, HoodieTableType tableType) {
 final HoodieTimeline oriTimeline = timeline;
-if (this.skipCompaction) {
+if (OptionsResolver.isMorTable(this.conf) & this.skipCompaction) {

Review Comment:
   Removed the para HoodieTableType.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6784) Support custom logic for deletion

2023-08-31 Thread Lin Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu updated HUDI-6784:
--
Status: Patch Available  (was: In Progress)

> Support custom logic for deletion
> -
>
> Key: HUDI-6784
> URL: https://issues.apache.org/jira/browse/HUDI-6784
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Lin Liu
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> Add `Optional<>` for newer parameter in merger. If newer is empty, then it 
> means this is a deletion operation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6702) Extend merge API to support all merging operations

2023-08-31 Thread Lin Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu updated HUDI-6702:
--
Status: Patch Available  (was: In Progress)

> Extend merge API to support all merging operations
> --
>
> Key: HUDI-6702
> URL: https://issues.apache.org/jira/browse/HUDI-6702
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Sagar Sumit
>Assignee: Lin Liu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> See this issue for more details- [https://github.com/apache/hudi/issues/9430]
> We may have to introduce a new API or figure out a way for the current merger 
> to skip empty records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6779) Audit current hoodie.properties

2023-08-31 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-6779.
-
Resolution: Done

> Audit current hoodie.properties
> ---
>
> Key: HUDI-6779
> URL: https://issues.apache.org/jira/browse/HUDI-6779
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
>
> Remove some configs from table to write configs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6780) Replace classnames by modes/enums in table properties

2023-08-31 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6780:
--
Reviewers: Danny Chen

> Replace classnames by modes/enums in table properties
> -
>
> Key: HUDI-6780
> URL: https://issues.apache.org/jira/browse/HUDI-6780
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6780) Replace classnames by modes/enums in table properties

2023-08-31 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6780:
--
Status: Patch Available  (was: In Progress)

> Replace classnames by modes/enums in table properties
> -
>
> Key: HUDI-6780
> URL: https://issues.apache.org/jira/browse/HUDI-6780
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6779) Audit current hoodie.properties

2023-08-31 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6779:
--
Status: Patch Available  (was: In Progress)

> Audit current hoodie.properties
> ---
>
> Key: HUDI-6779
> URL: https://issues.apache.org/jira/browse/HUDI-6779
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
> Fix For: 1.0.0
>
>
> Remove some configs from table to write configs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] linliu-code commented on pull request #9593: [HUDI-6784][RFC-46] Support deletion logic in merger

2023-08-31 Thread via GitHub



linliu-code commented on PR #9593:
URL: https://github.com/apache/hudi/pull/9593#issuecomment-1702005828

   @yihua @danny0405 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] linliu-code opened a new pull request, #9593: [HUDI-6784][RFC-46] Support deletion logic in merger

2023-08-31 Thread via GitHub



linliu-code opened a new pull request, #9593:
URL: https://github.com/apache/hudi/pull/9593

   ### Change Logs
   
   The solution is to add Option wrapper for older and newer parameters in 
merge api. In such way, all of update, delete, combine logics are merged into 
one api.
   
   TESTS:
   Unit tests are added for existing merger implementations.
   
   ### Impact
   
   Users could implement merge api to support their own logic about deletion 
now. Previously deletion is not supported.
   
   ### Risk level (write none, low medium or high below)
   
   Low.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Zouxxyy closed pull request #9573: [HUDI-6804] Fix hive read schema evolution MOR table

2023-08-31 Thread via GitHub



Zouxxyy closed pull request #9573: [HUDI-6804] Fix hive read schema evolution 
MOR table
URL: https://github.com/apache/hudi/pull/9573


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6742) Remove the log file appending for multiple instants

2023-08-31 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6742:
-
Status: In Progress  (was: Open)

> Remove the log file appending for multiple instants
> ---
>
> Key: HUDI-6742
> URL: https://issues.apache.org/jira/browse/HUDI-6742
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6725) Support efficient completion time queries on the timeline

2023-08-31 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6725:
-
Status: Patch Available  (was: In Progress)

> Support efficient completion time queries on the timeline
> -
>
> Key: HUDI-6725
> URL: https://issues.apache.org/jira/browse/HUDI-6725
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> The basic idea is we do a eager loading of the completion time on archived 
> timeline, for example, the last 3 days, and all the completed instants of the 
> active timeline.
> If a query is asking about a completion time earlier than that time range, 
> just do a lazy look up on the archvied timeline.
>  
> Probably we would write a completion time loader.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] danny0405 commented on a diff in pull request #9592: automatically create a database when using the flink catalog dfs mode

2023-08-31 Thread via GitHub



danny0405 commented on code in PR #9592:
URL: https://github.com/apache/hudi/pull/9592#discussion_r1312444984


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieCatalog.java:
##
@@ -125,6 +125,16 @@ public void open() throws CatalogException {
 } catch (IOException e) {
   throw new CatalogException(String.format("Checking catalog path %s 
exists exception.", catalogPathStr), e);
 }
+
+if (!databaseExists(getDefaultDatabase())) {
+  LOG.info("Creating database {} automatically because it does not 
exist.", getDefaultDatabase());
+  Path dbPath = new Path(catalogPath, getDefaultDatabase());

Review Comment:
   Can we write a test case for it in `TestHoodieCatalog`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 merged pull request #9583: [MINOR] Update operator name for compact&clustering test class

2023-08-31 Thread via GitHub



danny0405 merged PR #9583:
URL: https://github.com/apache/hudi/pull/9583


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [MINOR] Update operator name for compact&clustering test class (#9583)

2023-08-31 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 6f2e19d933c [MINOR] Update operator name for compact&clustering test 
class (#9583)
6f2e19d933c is described below

commit 6f2e19d933cdd086a1220824bffe6e28b7a50174
Author: hehuiyuan <471627...@qq.com>
AuthorDate: Fri Sep 1 09:42:36 2023 +0800

[MINOR] Update operator name for compact&clustering test class (#9583)
---
 .../org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java | 4 ++--
 .../org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java  | 8 
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java
 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java
index 18a8aebb8fd..4c817a7927a 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/cluster/ITTestHoodieFlinkClustering.java
@@ -410,8 +410,8 @@ public class ITTestHoodieFlinkClustering {
 // keep pending clustering, not committing clustering
 dataStream
 .addSink(new DiscardingSink<>())
-.name("clustering_commit")
-.uid("uid_clustering_commit")
+.name("discarding-sink")
+.uid("uid_discarding-sink")
 .setParallelism(1);
 
 env.execute("flink_hudi_clustering");
diff --git 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java
 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java
index b032ad46765..ac2d93a7305 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/compact/ITTestHoodieFlinkCompactor.java
@@ -175,8 +175,8 @@ public class ITTestHoodieFlinkCompactor {
 new CompactOperator(conf))
 .setParallelism(FlinkMiniCluster.DEFAULT_PARALLELISM)
 .addSink(new CompactionCommitSink(conf))
-.name("clean_commits")
-.uid("uid_clean_commits")
+.name("compaction_commit")
+.uid("uid_compaction_commit")
 .setParallelism(1);
 
 env.execute("flink_hudi_compaction");
@@ -256,8 +256,8 @@ public class ITTestHoodieFlinkCompactor {
 new CompactOperator(conf))
 .setParallelism(FlinkMiniCluster.DEFAULT_PARALLELISM)
 .addSink(new CompactionCommitSink(conf))
-.name("clean_commits")
-.uid("uid_clean_commits")
+.name("compaction_commit")
+.uid("uid_compaction_commit")
 .setParallelism(1);
 
 env.execute("flink_hudi_compaction");

[GitHub] [hudi] danny0405 commented on a diff in pull request #9577: [HUDI-6805] Print detailed error message in clustering

2023-08-31 Thread via GitHub



danny0405 commented on code in PR #9577:
URL: https://github.com/apache/hudi/pull/9577#discussion_r1312443728


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java:
##
@@ -241,6 +242,9 @@ public WriteStatus close() throws IOException {
 stat.setTotalWriteBytes(fileSizeInBytes);
 stat.setFileSizeInBytes(fileSizeInBytes);
 stat.setTotalWriteErrors(writeStatus.getTotalErrorRecords());
+for (Pair pair : 
writeStatus.getFailedRecords()) {
+  LOG.error("Failed to write {}", pair.getLeft(), pair.getRight());
+}

Review Comment:
   Is there any possibility we have too many records to print so the logs are 
overwhelmed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #9584: [HUDI-6808] SkipCompaction Config should not affect the stream read of the cow table

2023-08-31 Thread via GitHub



danny0405 commented on code in PR #9584:
URL: https://github.com/apache/hudi/pull/9584#discussion_r1312442629


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java:
##
@@ -601,9 +602,9 @@ public List filterInstantsWithRange(
* @return the filtered timeline
*/
   @VisibleForTesting
-  public HoodieTimeline filterInstantsAsPerUserConfigs(HoodieTimeline 
timeline) {
+  public HoodieTimeline filterInstantsAsPerUserConfigs(HoodieTimeline 
timeline, HoodieTableType tableType) {
 final HoodieTimeline oriTimeline = timeline;
-if (this.skipCompaction) {
+if (OptionsResolver.isMorTable(this.conf) & this.skipCompaction) {

Review Comment:
   There is no need to pass around the `HoodieTableType` now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6066) HoodieTableSource supports parquet predicate push down

2023-08-31 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-6066:
-
Fix Version/s: 1.0.0

> HoodieTableSource supports parquet predicate push down
> --
>
> Key: HUDI-6066
> URL: https://issues.apache.org/jira/browse/HUDI-6066
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> HoodieTableSource supports the implementation of SupportsFilterPushDown 
> interface that push down filter into FileIndex. HoodieTableSource should 
> support parquet predicate push down for query performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6066) HoodieTableSource supports parquet predicate push down

2023-08-31 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen closed HUDI-6066.

Resolution: Fixed

Fixed via master branch: 9fa00b7b1547ff46a1bea6d329e20dd702ff90b5

> HoodieTableSource supports parquet predicate push down
> --
>
> Key: HUDI-6066
> URL: https://issues.apache.org/jira/browse/HUDI-6066
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Nicholas Jiang
>Assignee: Nicholas Jiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> HoodieTableSource supports the implementation of SupportsFilterPushDown 
> interface that push down filter into FileIndex. HoodieTableSource should 
> support parquet predicate push down for query performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated: [HUDI-6066] HoodieTableSource supports parquet predicate push down (#8437)

2023-08-31 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 9fa00b7b154 [HUDI-6066] HoodieTableSource supports parquet predicate 
push down (#8437)
9fa00b7b154 is described below

commit 9fa00b7b1547ff46a1bea6d329e20dd702ff90b5
Author: Nicholas Jiang 
AuthorDate: Fri Sep 1 09:36:45 2023 +0800

[HUDI-6066] HoodieTableSource supports parquet predicate push down (#8437)
---
 .../apache/hudi/source/ExpressionPredicates.java   | 654 +
 .../org/apache/hudi/table/HoodieTableSource.java   |  18 +-
 .../apache/hudi/table/format/RecordIterators.java  |  60 +-
 .../hudi/table/format/cdc/CdcInputFormat.java  |  11 +-
 .../table/format/cow/CopyOnWriteInputFormat.java   |   9 +-
 .../table/format/mor/MergeOnReadInputFormat.java   |  17 +-
 .../hudi/source/TestExpressionPredicates.java  | 167 ++
 .../apache/hudi/table/ITTestHoodieDataSource.java  |  14 +
 .../apache/hudi/table/TestHoodieTableSource.java   |  23 +
 .../table/format/cow/ParquetSplitReaderUtil.java   |  10 +-
 .../reader/ParquetColumnarRowSplitReader.java  |  10 +-
 .../table/format/cow/ParquetSplitReaderUtil.java   |  10 +-
 .../reader/ParquetColumnarRowSplitReader.java  |  10 +-
 .../table/format/cow/ParquetSplitReaderUtil.java   |  10 +-
 .../reader/ParquetColumnarRowSplitReader.java  |  10 +-
 .../table/format/cow/ParquetSplitReaderUtil.java   |  10 +-
 .../reader/ParquetColumnarRowSplitReader.java  |  10 +-
 .../table/format/cow/ParquetSplitReaderUtil.java   |  10 +-
 .../reader/ParquetColumnarRowSplitReader.java  |  10 +-
 19 files changed, 1037 insertions(+), 36 deletions(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java
new file mode 100644
index 000..046e4b739ad
--- /dev/null
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java
@@ -0,0 +1,654 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.source;
+
+import org.apache.flink.table.expressions.CallExpression;
+import org.apache.flink.table.expressions.Expression;
+import org.apache.flink.table.expressions.FieldReferenceExpression;
+import org.apache.flink.table.expressions.ResolvedExpression;
+import org.apache.flink.table.expressions.ValueLiteralExpression;
+import org.apache.flink.table.functions.BuiltInFunctionDefinitions;
+import org.apache.flink.table.functions.FunctionDefinition;
+import org.apache.flink.table.types.logical.LogicalType;
+import org.apache.parquet.filter2.predicate.FilterPredicate;
+import org.apache.parquet.filter2.predicate.Operators;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Objects;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+import static org.apache.hudi.common.util.ValidationUtils.checkState;
+import static org.apache.hudi.util.ExpressionUtils.getValueFromLiteral;
+import static org.apache.parquet.filter2.predicate.FilterApi.and;
+import static org.apache.parquet.filter2.predicate.FilterApi.binaryColumn;
+import static org.apache.parquet.filter2.predicate.FilterApi.booleanColumn;
+import static org.apache.parquet.filter2.predicate.FilterApi.doubleColumn;
+import static org.apache.parquet.filter2.predicate.FilterApi.eq;
+import static org.apache.parquet.filter2.predicate.FilterApi.floatColumn;
+import static org.apache.parquet.filter2.predicate.FilterApi.gt;
+import static org.apache.parquet.filter2.predicate.FilterApi.gtEq;
+import static org.apache.parquet.filter2.predicate.FilterApi.intColumn;
+import static org.apache.parquet.filter2.predicate.FilterApi.longColumn;
+import static org.apache.parquet.filter2.predicate.FilterApi.lt;
+import static org.apache.parquet.filter2.predicate.FilterApi.ltEq;
+import static org.apache.parquet.filter2.predicate.FilterApi.not;
+import static

[GitHub] [hudi] danny0405 merged pull request #8437: [HUDI-6066] HoodieTableSource supports parquet predicate push down

2023-08-31 Thread via GitHub



danny0405 merged PR #8437:
URL: https://github.com/apache/hudi/pull/8437


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #9475: [HUDI-6766] Fixing mysql debezium data loss

2023-08-31 Thread via GitHub



danny0405 commented on PR #9475:
URL: https://github.com/apache/hudi/pull/9475#issuecomment-1701989413

   There are test failures in Travis.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] stream2000 commented on a diff in pull request #9515: [HUDI-2141] Support flink compaction metrics

2023-08-31 Thread via GitHub



stream2000 commented on code in PR #9515:
URL: https://github.com/apache/hudi/pull/9515#discussion_r1312437663


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/metrics/FlinkWriteMetrics.java:
##
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics;
+
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+
+import org.apache.flink.metrics.MetricGroup;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.text.ParseException;
+
+/**
+ * Common flink write commit metadata metrics
+ */
+public class FlinkWriteMetrics extends HoodieFlinkMetrics {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(FlinkWriteMetrics.class);
+
+  protected final String actionType;
+
+  private long totalPartitionsWritten;
+  private long totalFilesInsert;
+  private long totalFilesUpdate;
+  private long totalRecordsWritten;
+  private long totalUpdateRecordsWritten;
+  private long totalInsertRecordsWritten;
+  private long totalBytesWritten;
+  private long totalScanTime;
+  private long totalCreateTime;
+  private long totalUpsertTime;
+  private long totalCompactedRecordsUpdated;
+  private long totalLogFilesCompacted;
+  private long totalLogFilesSize;
+  private long commitLatencyInMs;
+  private long commitFreshnessInMs;
+  private long commitEpochTimeInMs;
+  private long durationInMs;
+
+  public FlinkWriteMetrics(MetricGroup metricGroup, String actionType) {
+super(metricGroup);
+this.actionType = actionType;
+  }
+
+  @Override
+  public void registerMetrics() {
+// register commit gauge
+metricGroup.gauge(getMetricsName(actionType, "totalPartitionsWritten"), () 
-> totalPartitionsWritten);
+metricGroup.gauge(getMetricsName(actionType, "totalFilesInsert"), () -> 
totalFilesInsert);
+metricGroup.gauge(getMetricsName(actionType, "totalFilesUpdate"), () -> 
totalFilesUpdate);
+metricGroup.gauge(getMetricsName(actionType, "totalRecordsWritten"), () -> 
totalRecordsWritten);
+metricGroup.gauge(getMetricsName(actionType, "totalUpdateRecordsWritten"), 
() -> totalUpdateRecordsWritten);
+metricGroup.gauge(getMetricsName(actionType, "totalInsertRecordsWritten"), 
() -> totalInsertRecordsWritten);
+metricGroup.gauge(getMetricsName(actionType, "totalBytesWritten"), () -> 
totalBytesWritten);
+metricGroup.gauge(getMetricsName(actionType, "totalScanTime"), () -> 
totalScanTime);
+metricGroup.gauge(getMetricsName(actionType, "totalCreateTime"), () -> 
totalCreateTime);
+metricGroup.gauge(getMetricsName(actionType, "totalUpsertTime"), () -> 
totalUpsertTime);
+metricGroup.gauge(getMetricsName(actionType, 
"totalCompactedRecordsUpdated"), () -> totalCompactedRecordsUpdated);

Review Comment:
   Yes of course. Will delete them later



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #9515: [HUDI-2141] Support flink compaction metrics

2023-08-31 Thread via GitHub



danny0405 commented on code in PR #9515:
URL: https://github.com/apache/hudi/pull/9515#discussion_r1312435023


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/metrics/FlinkWriteMetrics.java:
##
@@ -0,0 +1,130 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics;
+
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+
+import org.apache.flink.metrics.MetricGroup;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.text.ParseException;
+
+/**
+ * Common flink write commit metadata metrics
+ */
+public class FlinkWriteMetrics extends HoodieFlinkMetrics {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(FlinkWriteMetrics.class);
+
+  protected final String actionType;
+
+  private long totalPartitionsWritten;
+  private long totalFilesInsert;
+  private long totalFilesUpdate;
+  private long totalRecordsWritten;
+  private long totalUpdateRecordsWritten;
+  private long totalInsertRecordsWritten;
+  private long totalBytesWritten;
+  private long totalScanTime;
+  private long totalCreateTime;
+  private long totalUpsertTime;
+  private long totalCompactedRecordsUpdated;
+  private long totalLogFilesCompacted;
+  private long totalLogFilesSize;
+  private long commitLatencyInMs;
+  private long commitFreshnessInMs;
+  private long commitEpochTimeInMs;
+  private long durationInMs;
+
+  public FlinkWriteMetrics(MetricGroup metricGroup, String actionType) {
+super(metricGroup);
+this.actionType = actionType;
+  }
+
+  @Override
+  public void registerMetrics() {
+// register commit gauge
+metricGroup.gauge(getMetricsName(actionType, "totalPartitionsWritten"), () 
-> totalPartitionsWritten);
+metricGroup.gauge(getMetricsName(actionType, "totalFilesInsert"), () -> 
totalFilesInsert);
+metricGroup.gauge(getMetricsName(actionType, "totalFilesUpdate"), () -> 
totalFilesUpdate);
+metricGroup.gauge(getMetricsName(actionType, "totalRecordsWritten"), () -> 
totalRecordsWritten);
+metricGroup.gauge(getMetricsName(actionType, "totalUpdateRecordsWritten"), 
() -> totalUpdateRecordsWritten);
+metricGroup.gauge(getMetricsName(actionType, "totalInsertRecordsWritten"), 
() -> totalInsertRecordsWritten);
+metricGroup.gauge(getMetricsName(actionType, "totalBytesWritten"), () -> 
totalBytesWritten);
+metricGroup.gauge(getMetricsName(actionType, "totalScanTime"), () -> 
totalScanTime);
+metricGroup.gauge(getMetricsName(actionType, "totalCreateTime"), () -> 
totalCreateTime);
+metricGroup.gauge(getMetricsName(actionType, "totalUpsertTime"), () -> 
totalUpsertTime);
+metricGroup.gauge(getMetricsName(actionType, 
"totalCompactedRecordsUpdated"), () -> totalCompactedRecordsUpdated);

Review Comment:
   Can we drop these write metrics first until we introduce the coordinator 
metrics?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #9591: [SUPPORT] persist write status RDD in spark compaction job caused the resources could not be released in time

2023-08-31 Thread via GitHub



danny0405 commented on issue #9591:
URL: https://github.com/apache/hudi/issues/9591#issuecomment-1701976717

   cc @nsivabalan , guess the analysis is reasonable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #9587: [SUPPORT] hoodie.datasource.write.keygenerator.class config not work in bulk_insert mode

2023-08-31 Thread via GitHub



danny0405 commented on issue #9587:
URL: https://github.com/apache/hudi/issues/9587#issuecomment-1701974951

   You are right, because you only have one primary key field: `eid`, maybe you 
should set up the spark key generator as simple.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated (59f7d2806bf -> c4c5f3e8667)

2023-08-31 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 59f7d2806bf [HUDI-6562] Fixed issue for delete events for 
AWSDmsAvroPayload when CDC enabled (#9519)
 add c4c5f3e8667 [MINOR] Fix failing schema evolution tests in Flink 
versions < 1.17 (#9586)

No new revisions were added by this update.

Summary of changes:
 .../apache/hudi/table/ITTestSchemaEvolution.java   | 23 +++---
 1 file changed, 12 insertions(+), 11 deletions(-)

[GitHub] [hudi] danny0405 merged pull request #9586: [MINOR] Fix failing schema evolution tests in Flink versions < 1.17

2023-08-31 Thread via GitHub



danny0405 merged PR #9586:
URL: https://github.com/apache/hudi/pull/9586


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on a diff in pull request #9558: [HUDI-6481] Support run multi tables services in a single spark job

2023-08-31 Thread via GitHub



leesf commented on code in PR #9558:
URL: https://github.com/apache/hudi/pull/9558#discussion_r1312423032


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/multitable/HoodieMultiTableServicesMain.java:
##
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.multitable;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.utilities.HoodieCompactor;
+import org.apache.hudi.utilities.IdentitySplitter;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.streamer.HoodieStreamer;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.StringJoiner;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+
+/**
+ * Main function for executing multi-table services
+ */
+public class HoodieMultiTableServicesMain {
+  private static final Logger LOG = 
LoggerFactory.getLogger(HoodieStreamer.class);
+  final Config cfg;
+  final TypedProperties props;
+
+  private final JavaSparkContext jsc;
+
+  private ScheduledExecutorService executorService;
+
+  private void batchRunTableServices(List tablePaths) throws 
InterruptedException, ExecutionException {
+ExecutorService executorService = 
Executors.newFixedThreadPool(cfg.poolSize);
+List> futures = tablePaths.stream()
+.map(basePath -> CompletableFuture.runAsync(
+() -> MultiTableServiceUtils.buildTableServicePipeline(jsc, 
basePath, cfg, props).execute(),

Review Comment:
   should early exit if no services is enabled?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9572: [WIP][HUDI-6702] Remove unnecessary calls of `getInsertValue` api from HoodieRecordPayload

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9572:
URL: https://github.com/apache/hudi/pull/9572#issuecomment-1701933901

   
   ## CI report:
   
   * ad05887b523496f59ac8b6e976183d6c325ed94d UNKNOWN
   * 93813ed1bd85993d5e0674f5ff4e01964338cd49 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19586)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9546: [HUDI-6397] [HUDI-6759] Fixing misc bugs w/ metadata table

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9546:
URL: https://github.com/apache/hudi/pull/9546#issuecomment-1701933763

   
   ## CI report:
   
   * 5472cd308f526d6679eba8682957b36d46679f62 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19585)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9581: [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-1701927483

   
   ## CI report:
   
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   * e286659cb1e1cb69126b8ec09d4e2a62969ce9d4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19587)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9521: [HUDI-6736] Fixing rollback completion and commit timeline files removal

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9521:
URL: https://github.com/apache/hudi/pull/9521#issuecomment-1701927027

   
   ## CI report:
   
   * c22c23106d356cd295067d1330828384c8bdb902 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19584)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on pull request #8437: [HUDI-6066] HoodieTableSource supports parquet predicate push down

2023-08-31 Thread via GitHub



yihua commented on PR #8437:
URL: https://github.com/apache/hudi/pull/8437#issuecomment-1701917436

   @danny0405 could you help review this again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua merged pull request #9519: [HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC enabled

2023-08-31 Thread via GitHub



yihua merged PR #9519:
URL: https://github.com/apache/hudi/pull/9519


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC enabled (#9519)

2023-08-31 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 59f7d2806bf [HUDI-6562] Fixed issue for delete events for 
AWSDmsAvroPayload when CDC enabled (#9519)
59f7d2806bf is described below

commit 59f7d2806bfc2d402dc8f5694dcb9d345e3d5a55
Author: Aditya Goenka <63430370+ad1happy...@users.noreply.github.com>
AuthorDate: Fri Sep 1 04:47:48 2023 +0530

[HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC 
enabled (#9519)

Co-authored-by: Y Ethan Guo 
---
 .../hudi/io/HoodieMergeHandleWithChangeLog.java|  2 +-
 .../functional/cdc/TestCDCDataFrameSuite.scala | 56 +-
 2 files changed, 56 insertions(+), 2 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
index d610891c2ca..f8669416f0c 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java
@@ -103,7 +103,7 @@ public class HoodieMergeHandleWithChangeLog 
extends HoodieMergeHandl
 // TODO Remove these unnecessary newInstance invocations
 HoodieRecord savedRecord = newRecord.newInstance();
 super.writeInsertRecord(newRecord);
-if (!HoodieOperation.isDelete(newRecord.getOperation())) {
+if (!HoodieOperation.isDelete(newRecord.getOperation()) && 
!savedRecord.isDelete(schema, config.getPayloadConfig().getProps())) {
   cdcLogger.put(newRecord, null, savedRecord.toIndexedRecord(schema, 
config.getPayloadConfig().getProps()).map(HoodieAvroIndexedRecord::getData));
 }
   }
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala
index 36629687106..aac836d8c3a 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala
@@ -26,7 +26,8 @@ import org.apache.hudi.common.table.cdc.{HoodieCDCOperation, 
HoodieCDCSupplement
 import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient, 
TableSchemaResolver}
 import org.apache.hudi.common.testutils.HoodieTestDataGenerator
 import 
org.apache.hudi.common.testutils.RawTripTestPayload.{deleteRecordsToStrings, 
recordsToStrings}
-import org.apache.spark.sql.SaveMode
+import org.apache.spark.sql.{Row, SaveMode}
+import org.apache.spark.sql.types.{StringType, StructField, StructType}
 import org.junit.jupiter.api.Assertions.{assertEquals, assertFalse, assertTrue}
 import org.junit.jupiter.params.ParameterizedTest
 import org.junit.jupiter.params.provider.{CsvSource, EnumSource}
@@ -634,4 +635,57 @@ class TestCDCDataFrameSuite extends HoodieCDCTestBase {
 val cdcDataOnly2 = cdcDataFrame((commitTime2.toLong - 1).toString)
 assertCDCOpCnt(cdcDataOnly2, insertedCnt2, updatedCnt2, 0)
   }
+
+  @ParameterizedTest
+  @EnumSource(classOf[HoodieCDCSupplementalLoggingMode])
+  def testCDCWithAWSDMSPayload(loggingMode: HoodieCDCSupplementalLoggingMode): 
Unit = {
+val options = Map(
+  "hoodie.table.name" -> "test",
+  "hoodie.datasource.write.recordkey.field" -> "id",
+  "hoodie.datasource.write.precombine.field" -> "replicadmstimestamp",
+  "hoodie.datasource.write.keygenerator.class" -> 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator",
+  "hoodie.datasource.write.partitionpath.field" -> "",
+  "hoodie.datasource.write.payload.class" -> 
"org.apache.hudi.common.model.AWSDmsAvroPayload",
+  "hoodie.table.cdc.enabled" -> "true",
+  "hoodie.table.cdc.supplemental.logging.mode" -> "data_before_after"
+)
+
+val data: Seq[(String, String, String, String)] = Seq(
+  ("1", "I", "2023-06-14 15:46:06.953746", "A"),
+  ("2", "I", "2023-06-14 15:46:07.953746", "B"),
+  ("3", "I", "2023-06-14 15:46:08.953746", "C")
+)
+
+val schema: StructType = StructType(Seq(
+  StructField("id", StringType),
+  StructField("Op", StringType),
+  StructField("replicadmstimestamp", StringType),
+  StructField("code", StringType)
+))
+
+val df = spark.createDataFrame(data.map(Row.fromTuple), schema)
+df.write
+  .format("org.apache.hudi")
+  .option("hoodie.datasource.write.operation", "upsert")
+  .options(options)
+  .mode("append")
+  .save(basePath)
+
+assertEquals(spark.read.format("org.apache.hudi").load(basePath).count(), 
3)
+
+val newData: Seq[(String, String, St

[GitHub] [hudi] yihua commented on a diff in pull request #9519: [HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC enabled

2023-08-31 Thread via GitHub



yihua commented on code in PR #9519:
URL: https://github.com/apache/hudi/pull/9519#discussion_r1312375700


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleWithChangeLog.java:
##
@@ -103,7 +103,7 @@ protected void writeInsertRecord(HoodieRecord newRecord) 
throws IOException {
 // TODO Remove these unnecessary newInstance invocations
 HoodieRecord savedRecord = newRecord.newInstance();
 super.writeInsertRecord(newRecord);
-if (!HoodieOperation.isDelete(newRecord.getOperation())) {
+if (!HoodieOperation.isDelete(newRecord.getOperation()) && 
!savedRecord.isDelete(schema, config.getPayloadConfig().getProps())) {

Review Comment:
   I think we should (i.e., adding `else` block to handle the deletes).  
However, it's a different issue we need to tackle.  I'll follow up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9519: [HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC enabled

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9519:
URL: https://github.com/apache/hudi/pull/9519#issuecomment-1701899670

   
   ## CI report:
   
   * c727303e24756595101e6b8319a250a6476aa012 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on pull request #9519: [HUDI-6562] Fixed issue for delete events for AWSDmsAvroPayload when CDC enabled

2023-08-31 Thread via GitHub



yihua commented on PR #9519:
URL: https://github.com/apache/hudi/pull/9519#issuecomment-1701899052

   CI is green
   https://github.com/apache/hudi/assets/2497195/a1a6470d-f015-4687-a9b7-a2e01116b28e";>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9571: Enabling comprehensive schema evolution in delta streamer code

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9571:
URL: https://github.com/apache/hudi/pull/9571#issuecomment-1701887365

   
   ## CI report:
   
   * 3af6011d72b294b0995d52be40a6d91e6eff9a1b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19561)
 
   * 871ff24da9c3800b8f19bdabda140621549aaf3b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19588)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9585: [HUDI-6809] Optimizing the judgment of generating clustering plans

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9585:
URL: https://github.com/apache/hudi/pull/9585#issuecomment-1701850344

   
   ## CI report:
   
   * 9a2675de94095d2baac571a6dd71ec368b8a9e8c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19582)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9571: Enabling comprehensive schema evolution in delta streamer code

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9571:
URL: https://github.com/apache/hudi/pull/9571#issuecomment-1701850224

   
   ## CI report:
   
   * 3af6011d72b294b0995d52be40a6d91e6eff9a1b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19561)
 
   * 871ff24da9c3800b8f19bdabda140621549aaf3b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9581: [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-1701842618

   
   ## CI report:
   
   * 1208189ffb60441f9544933a2446ad194509c391 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19565)
 
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   * e286659cb1e1cb69126b8ec09d4e2a62969ce9d4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19587)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9581: [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-1701802062

   
   ## CI report:
   
   * 1208189ffb60441f9544933a2446ad194509c391 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19565)
 
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   * e286659cb1e1cb69126b8ec09d4e2a62969ce9d4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9581: [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9581:
URL: https://github.com/apache/hudi/pull/9581#issuecomment-1701789789

   
   ## CI report:
   
   * 1208189ffb60441f9544933a2446ad194509c391 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19565)
 
   * 50e495ed1223eaf19ec6f0fd1f00ed13bb3c487f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9572: [WIP][HUDI-6702] Remove unnecessary calls of `getInsertValue` api from HoodieRecordPayload

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9572:
URL: https://github.com/apache/hudi/pull/9572#issuecomment-1701789643

   
   ## CI report:
   
   * ad05887b523496f59ac8b6e976183d6c325ed94d UNKNOWN
   * cf848446b9c837be3c1c2fdc7930b26f920a0754 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19563)
 
   * 93813ed1bd85993d5e0674f5ff4e01964338cd49 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19586)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9592: automatically create a database when using the flink catalog dfs mode

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9592:
URL: https://github.com/apache/hudi/pull/9592#issuecomment-1701777353

   
   ## CI report:
   
   * c961be19038e5600f418ef660b7ede740cef76c6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19581)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9572: [WIP][HUDI-6702] Remove unnecessary calls of `getInsertValue` api from HoodieRecordPayload

2023-08-31 Thread via GitHub



hudi-bot commented on PR #9572:
URL: https://github.com/apache/hudi/pull/9572#issuecomment-1701777190

   
   ## CI report:
   
   * ad05887b523496f59ac8b6e976183d6c325ed94d UNKNOWN
   * cf848446b9c837be3c1c2fdc7930b26f920a0754 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19563)
 
   * 93813ed1bd85993d5e0674f5ff4e01964338cd49 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6795) Implement generation of record_positions for updates and deletes on write path

2023-08-31 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6795:

Status: Patch Available  (was: In Progress)

> Implement generation of record_positions for updates and deletes on write path
> --
>
> Key: HUDI-6795
> URL: https://issues.apache.org/jira/browse/HUDI-6795
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6785) Introduce an engine-agnostic FileGroupReader for snapshot read

2023-08-31 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6785:

Status: In Progress  (was: Open)

> Introduce an engine-agnostic FileGroupReader for snapshot read
> --
>
> Key: HUDI-6785
> URL: https://issues.apache.org/jira/browse/HUDI-6785
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Blocker
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-5463) Apply rollback commits from data table as rollbacks in MDT instead of Delta commit

2023-08-31 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-5463:
---

Assignee: (was: sivabalan narayanan)

> Apply rollback commits from data table as rollbacks in MDT instead of Delta 
> commit
> --
>
> Key: HUDI-5463
> URL: https://issues.apache.org/jira/browse/HUDI-5463
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.14.0
>
>
> As of now, any rollback in DT is another DC in MDT. this may not scale for 
> record level index in MDT since we have to add 1000s of delete records and 
> finally have to resolve all valid and invalid records. So, its better to 
> rollback the commit in MDT as well instead of doing a DC. 
>  
> Impact: 
> record level index is unusable w/o this change. While fixing other rollback 
> related tickets, do consider this as a possible option if this simplifies 
> other fixes. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6795) Implement generation of record_positions for updates and deletes on write path

2023-08-31 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6795:

Reviewers: sivabalan narayanan

> Implement generation of record_positions for updates and deletes on write path
> --
>
> Key: HUDI-6795
> URL: https://issues.apache.org/jira/browse/HUDI-6795
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] linliu-code commented on pull request #9572: [WIP][HUDI-6702] Remove unnecessary calls of `getInsertValue` api from HoodieRecordPayload

2023-08-31 Thread via GitHub



linliu-code commented on PR #9572:
URL: https://github.com/apache/hudi/pull/9572#issuecomment-1701748977

   @yihua @danny0405 please comment! Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 189 matches

Mail list logo