date:20230718

[GitHub] [hudi] hudi-bot commented on pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.

2023-07-18 Thread via GitHub



hudi-bot commented on PR #8837:
URL: https://github.com/apache/hudi/pull/8837#issuecomment-1641508668

   
   ## CI report:
   
   * fa5c3f22ad50c6bdf4cf8fa04f51ecfba1cd8905 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18677)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] amrishlal commented on a diff in pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.

2023-07-18 Thread via GitHub



amrishlal commented on code in PR #9203:
URL: https://github.com/apache/hudi/pull/9203#discussion_r1267608249


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTableWithNonRecordKeyField.scala:
##
@@ -22,122 +22,128 @@ import org.apache.hudi.{HoodieSparkUtils, 
ScalaAssertionSupport}
 class TestMergeIntoTableWithNonRecordKeyField extends HoodieSparkSqlTestBase 
with ScalaAssertionSupport {
 
   test("Test Merge into extra cond") {
-withTempDir { tmp =>
-  val tableName = generateTableName
-  spark.sql(
-s"""
-   |create table $tableName (
-   |  id int,
-   |  name string,
-   |  price double,
-   |  ts long
-   |) using hudi
-   | location '${tmp.getCanonicalPath}/$tableName'
-   | tblproperties (
-   |  primaryKey ='id',
-   |  preCombineField = 'ts'
-   | )
+Seq(true, false).foreach { optimizedSqlEnabled =>
+  withTempDir { tmp =>
+val tableName = generateTableName
+spark.sql(
+  s"""
+ |create table $tableName (
+ |  id int,
+ |  name string,
+ |  price double,
+ |  ts long
+ |) using hudi
+ | location '${tmp.getCanonicalPath}/$tableName'
+ | tblproperties (
+ |  primaryKey ='id',
+ |  preCombineField = 'ts'
+ | )
""".stripMargin)
-  val tableName2 = generateTableName
-  spark.sql(
-s"""
-   |create table $tableName2 (
-   |  id int,
-   |  name string,
-   |  price double,
-   |  ts long
-   |) using hudi
-   | location '${tmp.getCanonicalPath}/$tableName2'
-   | tblproperties (
-   |  primaryKey ='id',
-   |  preCombineField = 'ts'
-   | )
+val tableName2 = generateTableName
+spark.sql(
+  s"""
+ |create table $tableName2 (
+ |  id int,
+ |  name string,
+ |  price double,
+ |  ts long
+ |) using hudi
+ | location '${tmp.getCanonicalPath}/$tableName2'
+ | tblproperties (
+ |  primaryKey ='id',
+ |  preCombineField = 'ts'
+ | )
""".stripMargin)
 
-  spark.sql(
-s"""
-   |insert into $tableName values
-   |(1, 'a1', 10, 100),
-   |(2, 'a2', 20, 200),
-   |(3, 'a3', 20, 100)
-   |""".stripMargin)
-  spark.sql(
-s"""
-   |insert into $tableName2 values
-   |(1, 'u1', 10, 999),
-   |(3, 'u3', 30, ),
-   |(4, 'u4', 40, 9)
-   |""".stripMargin)
+spark.sql(
+  s"""
+ |insert into $tableName values
+ |(1, 'a1', 10, 100),
+ |(2, 'a2', 20, 200),
+ |(3, 'a3', 20, 100)
+ |""".stripMargin)
+spark.sql(
+  s"""
+ |insert into $tableName2 values
+ |(1, 'u1', 10, 999),
+ |(3, 'u3', 30, ),
+ |(4, 'u4', 40, 9)
+ |""".stripMargin)
 
-  spark.sql(
-s"""
-   |merge into $tableName as oldData
-   |using $tableName2
-   |on oldData.id = $tableName2.id
-   |when matched and oldData.price = $tableName2.price then update set 
oldData.name = $tableName2.name
-   |
-   |""".stripMargin)
+// test with optimized sql merge enabled / disabled.
+spark.sql(s"set 
hoodie.spark.sql.optimized.merge.enable=$optimizedSqlEnabled")
 
-  checkAnswer(s"select id, name, price, ts from $tableName")(
-Seq(1, "u1", 10.0, 100),
-Seq(3, "a3", 20.0, 100),
-Seq(2, "a2", 20.0, 200)
-  )
+spark.sql(
+  s"""
+ |merge into $tableName as oldData
+ |using $tableName2
+ |on oldData.id = $tableName2.id
+ |when matched and oldData.price = $tableName2.price then update 
set oldData.name = $tableName2.name
+ |
+ |""".stripMargin)
 
-  val errorMessage = if (HoodieSparkUtils.gteqSpark3_1) {
-"Only simple conditions of the form `t.id = s.id` using primary key or 
partition path " +
-  "columns are allowed on tables with primary key. (illegal column(s) 
used: `price`"
-  } else {
-"Only simple conditions of the form `t.id = s.id` using primary key or 
partition path " +
-  "columns are allowed on tables with primary key. (illegal column(s) 
used: `price`;"
-  }
+checkAnswer(s"select id, name, price, ts from $tableName")(
+  Seq(1, "u1", 10.0, 100),
+  Seq(3, "a3", 20.0, 100),
+  Seq(2, "a2", 20.0, 200)
+)
 
-  checkException(
-s"""
-   |merge into $tableName as

[GitHub] [hudi] amrishlal commented on a diff in pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.

2023-07-18 Thread via GitHub



amrishlal commented on code in PR #9203:
URL: https://github.com/apache/hudi/pull/9203#discussion_r1267607979


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -642,11 +642,11 @@ object DataSourceWriteOptions {
   val DROP_PARTITION_COLUMNS: ConfigProperty[java.lang.Boolean] = 
HoodieTableConfig.DROP_PARTITION_COLUMNS
 
   val ENABLE_OPTIMIZED_SQL_WRITES: ConfigProperty[String] = ConfigProperty
-.key("hoodie.spark.sql.writes.optimized.enable")
+.key("hoodie.spark.sql.optimized.writes.enable")

Review Comment:
   Fixed.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##
@@ -146,9 +144,7 @@ class DefaultSource extends RelationProvider
   mode: SaveMode,
   optParams: Map[String, String],
   rawDf: DataFrame): BaseRelation = {
-val df = if (optParams.getOrDefault(DATASOURCE_WRITE_PREPPED_KEY,
-  optParams.getOrDefault(SQL_MERGE_INTO_WRITES.key(), 
SQL_MERGE_INTO_WRITES.defaultValue().toString))
-  .equalsIgnoreCase("true")) {
+val df = if (optParams.getOrDefault(DATASOURCE_WRITE_PREPPED_KEY, 
"false").toBoolean || optParams.getOrDefault(WRITE_PREPPED_MERGE_KEY, 
"false").toBoolean) {

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] amrishlal commented on a diff in pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.

2023-07-18 Thread via GitHub



amrishlal commented on code in PR #9203:
URL: https://github.com/apache/hudi/pull/9203#discussion_r1267607758


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -721,6 +721,8 @@ public class HoodieWriteConfig extends HoodieConfig {
   + "The class must be a subclass of 
`org.apache.hudi.callback.HoodieClientInitCallback`."
   + "By default, no Hudi client init callback is executed.");
 
+  public static final String WRITE_PREPPED_MERGE_KEY = 
"_hoodie.datasource.merge.into.prepped";

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8452: [HUDI-6077] Add more partition push down filters

2023-07-18 Thread via GitHub



hudi-bot commented on PR #8452:
URL: https://github.com/apache/hudi/pull/8452#issuecomment-1641498996

   
   ## CI report:
   
   * 8082df232089396b2a9f9be2b915e51b3645f172 UNKNOWN
   * 66d853918fe311dbc1d889297aab5277833b5c3b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18651)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18654)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18658)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18680)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9203:
URL: https://github.com/apache/hudi/pull/9203#issuecomment-1641493001

   
   ## CI report:
   
   * 585935c37efc35994dd721ba2d8f05c9cf775470 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18667)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18665)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18669)
 
   * 539cad2f3b8edde2211bd8ddeeb3feec15cd6e94 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18679)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] boneanxs commented on pull request #8452: [HUDI-6077] Add more partition push down filters

2023-07-18 Thread via GitHub



boneanxs commented on PR #8452:
URL: https://github.com/apache/hudi/pull/8452#issuecomment-1641477916

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6561) Ensure there is no data duplication with spark streaming writes

2023-07-18 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-6561:
-

 Summary: Ensure there is no data duplication with spark streaming 
writes 
 Key: HUDI-6561
 URL: https://issues.apache.org/jira/browse/HUDI-6561
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark
Reporter: sivabalan narayanan


w/ spark-streaming writes, we can deduce first batch using batchId vs an 
existing batch which got resumed after a long long time. 

 

we should guarantee idempotency by deducing the batch Id 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9203:
URL: https://github.com/apache/hudi/pull/9203#issuecomment-1641450952

   
   ## CI report:
   
   * 585935c37efc35994dd721ba2d8f05c9cf775470 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18667)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18665)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18669)
 
   * 539cad2f3b8edde2211bd8ddeeb3feec15cd6e94 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9226: [HUDI-6352] take actual commit time (StateTransitionTime) into consid…

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9226:
URL: https://github.com/apache/hudi/pull/9226#issuecomment-1641451034

   
   ## CI report:
   
   * c74087e82eb4bec52b33a07679d2ecbc3aba43c9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18676)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-07-18 Thread via GitHub



codope commented on code in PR #9223:
URL: https://github.com/apache/hudi/pull/9223#discussion_r1267505437


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##
@@ -405,6 +405,9 @@ private boolean initializeFromFilesystem(String 
initializationTime, List 
convertFilesToBloomFilterRecords(HoodieEn
   
Map> partitionToAppendedFiles,
   
MetadataRecordsGenerationParams recordsGenerationParams,
   
String instantTime) {
-HoodieData allRecordsRDD = engineContext.emptyHoodieData();
-
-List>> partitionToDeletedFilesList = 
partitionToDeletedFiles.entrySet()
-.stream().map(e -> Pair.of(e.getKey(), 
e.getValue())).collect(Collectors.toList());
-int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
-HoodieData>> partitionToDeletedFilesRDD = 
engineContext.parallelize(partitionToDeletedFilesList, parallelism);
-
-HoodieData deletedFilesRecordsRDD = 
partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> {
-  final String partitionName = partitionToDeletedFilesPair.getLeft();
-  final List deletedFileList = 
partitionToDeletedFilesPair.getRight();
-  return deletedFileList.stream().flatMap(deletedFile -> {
-if (!FSUtils.isBaseFile(new Path(deletedFile))) {
-  return Stream.empty();
-}
-
-final String partition = getPartitionIdentifier(partitionName);
-return 
Stream.of(HoodieMetadataPayload.createBloomFilterMetadataRecord(
-partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, 
ByteBuffer.allocate(0), true));
-  }).iterator();
-});
-allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD);
+// Total number of files which are added or deleted
+final int totalFiles = 
partitionToDeletedFiles.values().stream().mapToInt(List::size).sum()
++ partitionToAppendedFiles.values().stream().mapToInt(Map::size).sum();
+
+// Create the tuple (partition, filename, isDeleted) to handle both 
deletes and appends
+final List> partitionFileFlagTupleList = 
new ArrayList<>(totalFiles);
+partitionToDeletedFiles.entrySet().stream()
+.flatMap(entry -> entry.getValue().stream().map(deletedFile -> new 
Tuple3<>(entry.getKey(), deletedFile, true)))
+.collect(Collectors.toCollection(() -> partitionFileFlagTupleList));
+partitionToAppendedFiles.entrySet().stream()
+.flatMap(entry -> entry.getValue().keySet().stream().map(addedFile -> 
new Tuple3<>(entry.getKey(), addedFile, false)))
+.collect(Collectors.toCollection(() -> partitionFileFlagTupleList));

Review Comment:
   We can probably extract this tuple creation code to a separate method. Looks 
repetitive for both bloom filter and colstats.



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -915,65 +903,60 @@ public static HoodieData 
convertFilesToColumnStatsRecords(HoodieEn
   
Map> partitionToDeletedFiles,
   
Map> partitionToAppendedFiles,
   
MetadataRecordsGenerationParams recordsGenerationParams) {
-HoodieData allRecordsRDD = engineContext.emptyHoodieData();
+// Find the columns to index
 HoodieTableMetaClient dataTableMetaClient = 
recordsGenerationParams.getDataMetaClient();
-
 final List columnsToIndex =
 getColumnsToIndex(recordsGenerationParams,
 Lazy.lazily(() -> tryResolveSchemaForTable(dataTableMetaClient)));
-
 if (columnsToIndex.isEmpty()) {
   // In case there are no columns to index, bail
   return engineContext.emptyHoodieData();
 }
 
-final List>> partitionToDeletedFilesList = 
partitionToDeletedFiles.entrySet().stream()
-.map(e -> Pair.of(e.getKey(), e.getValue()))
-.collect(Collectors.toList());
-
-int deletedFilesTargetParallelism = 
Math.max(Math.min(partitionToDeletedFilesList.size(), 
recordsGenerationParams.getColumnStatsIndexParallelism()), 1);
-final HoodieData>> partitionToDeletedFilesRDD =
-engineContext.parallelize(partitionToDeletedFilesList, 
deletedFilesTargetParallelism);
-
-HoodieData deletedFilesRecordsRDD = 
partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> {
-  final String partitionPath = partitionToDeletedFilesPair.getLeft();
-  final String partitionId = getPartitionIdentifier(partitionPath);
-  final List deletedFileList = 
partitionToDeletedFilesPair.getRight();
-
-  return deletedFileList.stream().flatMa

[GitHub] [hudi] nsivabalan commented on a diff in pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.

2023-07-18 Thread via GitHub



nsivabalan commented on code in PR #9203:
URL: https://github.com/apache/hudi/pull/9203#discussion_r1267546687


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -642,11 +642,11 @@ object DataSourceWriteOptions {
   val DROP_PARTITION_COLUMNS: ConfigProperty[java.lang.Boolean] = 
HoodieTableConfig.DROP_PARTITION_COLUMNS
 
   val ENABLE_OPTIMIZED_SQL_WRITES: ConfigProperty[String] = ConfigProperty
-.key("hoodie.spark.sql.writes.optimized.enable")
+.key("hoodie.spark.sql.optimized.writes.enable")

Review Comment:
   generally we try to align the var naming to the key. 
   Lets name the variable 
   SPARK_SQL_OPTIMIZED_WRITES
   



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -721,6 +721,8 @@ public class HoodieWriteConfig extends HoodieConfig {
   + "The class must be a subclass of 
`org.apache.hudi.callback.HoodieClientInitCallback`."
   + "By default, no Hudi client init callback is executed.");
 
+  public static final String WRITE_PREPPED_MERGE_KEY = 
"_hoodie.datasource.merge.into.prepped";

Review Comment:
   can we add java docs calling out the purpose of this



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTableWithNonRecordKeyField.scala:
##
@@ -22,122 +22,128 @@ import org.apache.hudi.{HoodieSparkUtils, 
ScalaAssertionSupport}
 class TestMergeIntoTableWithNonRecordKeyField extends HoodieSparkSqlTestBase 
with ScalaAssertionSupport {
 
   test("Test Merge into extra cond") {
-withTempDir { tmp =>
-  val tableName = generateTableName
-  spark.sql(
-s"""
-   |create table $tableName (
-   |  id int,
-   |  name string,
-   |  price double,
-   |  ts long
-   |) using hudi
-   | location '${tmp.getCanonicalPath}/$tableName'
-   | tblproperties (
-   |  primaryKey ='id',
-   |  preCombineField = 'ts'
-   | )
+Seq(true, false).foreach { optimizedSqlEnabled =>
+  withTempDir { tmp =>
+val tableName = generateTableName
+spark.sql(
+  s"""
+ |create table $tableName (
+ |  id int,
+ |  name string,
+ |  price double,
+ |  ts long
+ |) using hudi
+ | location '${tmp.getCanonicalPath}/$tableName'
+ | tblproperties (
+ |  primaryKey ='id',
+ |  preCombineField = 'ts'
+ | )
""".stripMargin)
-  val tableName2 = generateTableName
-  spark.sql(
-s"""
-   |create table $tableName2 (
-   |  id int,
-   |  name string,
-   |  price double,
-   |  ts long
-   |) using hudi
-   | location '${tmp.getCanonicalPath}/$tableName2'
-   | tblproperties (
-   |  primaryKey ='id',
-   |  preCombineField = 'ts'
-   | )
+val tableName2 = generateTableName
+spark.sql(
+  s"""
+ |create table $tableName2 (
+ |  id int,
+ |  name string,
+ |  price double,
+ |  ts long
+ |) using hudi
+ | location '${tmp.getCanonicalPath}/$tableName2'
+ | tblproperties (
+ |  primaryKey ='id',
+ |  preCombineField = 'ts'
+ | )
""".stripMargin)
 
-  spark.sql(
-s"""
-   |insert into $tableName values
-   |(1, 'a1', 10, 100),
-   |(2, 'a2', 20, 200),
-   |(3, 'a3', 20, 100)
-   |""".stripMargin)
-  spark.sql(
-s"""
-   |insert into $tableName2 values
-   |(1, 'u1', 10, 999),
-   |(3, 'u3', 30, ),
-   |(4, 'u4', 40, 9)
-   |""".stripMargin)
+spark.sql(
+  s"""
+ |insert into $tableName values
+ |(1, 'a1', 10, 100),
+ |(2, 'a2', 20, 200),
+ |(3, 'a3', 20, 100)
+ |""".stripMargin)
+spark.sql(
+  s"""
+ |insert into $tableName2 values
+ |(1, 'u1', 10, 999),
+ |(3, 'u3', 30, ),
+ |(4, 'u4', 40, 9)
+ |""".stripMargin)
 
-  spark.sql(
-s"""
-   |merge into $tableName as oldData
-   |using $tableName2
-   |on oldData.id = $tableName2.id
-   |when matched and oldData.price = $tableName2.price then update set 
oldData.name = $tableName2.name
-   |
-   |""".stripMargin)
+// test with optimized sql merge enabled / disabled.
+spark.sql(s"set 
hoodie.spark.sql.optimized.merge.enable=$optimizedSqlEnabled")
 
-  checkAnswer(s"select id, name, price, ts from $tableName")(
-Seq(1, "u1", 10.0, 100),
-

[GitHub] [hudi] amrishlal commented on a diff in pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.

2023-07-18 Thread via GitHub



amrishlal commented on code in PR #9203:
URL: https://github.com/apache/hudi/pull/9203#discussion_r1267531685


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -721,6 +721,8 @@ public class HoodieWriteConfig extends HoodieConfig {
   + "The class must be a subclass of 
`org.apache.hudi.callback.HoodieClientInitCallback`."
   + "By default, no Hudi client init callback is executed.");
 
+  public static final String WRITE_PREPPED_MERGE_KEY = 
"_hoodie.datasource.merge.prepped";
+

Review Comment:
   Fixed.



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -642,11 +642,18 @@ object DataSourceWriteOptions {
   val DROP_PARTITION_COLUMNS: ConfigProperty[java.lang.Boolean] = 
HoodieTableConfig.DROP_PARTITION_COLUMNS
 
   val ENABLE_OPTIMIZED_SQL_WRITES: ConfigProperty[String] = ConfigProperty
-.key("hoodie.spark.sql.writes.optimized.enable")
+.key("hoodie.spark.sql.optimized.writes.enable")
 .defaultValue("true")
 .markAdvanced()
 .sinceVersion("0.14.0")
-.withDocumentation("Controls whether spark sql optimized update is 
enabled.")
+.withDocumentation("Controls whether spark sql prepped update and delete 
is enabled.")
+
+  val ENABLE_OPTIMIZED_SQL_MERGE_WRITES: ConfigProperty[String] = 
ConfigProperty

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9227: [HUDI-6560] Avoid to read instant details 2 times for archiving

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9227:
URL: https://github.com/apache/hudi/pull/9227#issuecomment-1641371506

   
   ## CI report:
   
   * 1c756c1c634bb1db2bdcbd7eca3a045f4ea99a5b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18678)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.

2023-07-18 Thread via GitHub



hudi-bot commented on PR #8837:
URL: https://github.com/apache/hudi/pull/8837#issuecomment-1641371072

   
   ## CI report:
   
   * 2f9aa542076faa188839bc55b43dd7f22ec32b62 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18580)
 
   * fa5c3f22ad50c6bdf4cf8fa04f51ecfba1cd8905 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18677)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9227: [HUDI-6560] Avoid to read instant details 2 times for archiving

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9227:
URL: https://github.com/apache/hudi/pull/9227#issuecomment-1641367288

   
   ## CI report:
   
   * 1c756c1c634bb1db2bdcbd7eca3a045f4ea99a5b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.

2023-07-18 Thread via GitHub



hudi-bot commented on PR #8837:
URL: https://github.com/apache/hudi/pull/8837#issuecomment-1641366845

   
   ## CI report:
   
   * 2f9aa542076faa188839bc55b43dd7f22ec32b62 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18580)
 
   * fa5c3f22ad50c6bdf4cf8fa04f51ecfba1cd8905 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6560) Avoid to read instant details 2 times for archiving

2023-07-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6560:
-
Labels: pull-request-available  (was: )

> Avoid to read instant details 2 times for archiving
> ---
>
> Key: HUDI-6560
> URL: https://issues.apache.org/jira/browse/HUDI-6560
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] danny0405 opened a new pull request, #9227: [HUDI-6560] Avoid to read instant details 2 times for archiving

2023-07-18 Thread via GitHub



danny0405 opened a new pull request, #9227:
URL: https://github.com/apache/hudi/pull/9227

   ### Change Logs
   
   1. only load the instant deatils once for each instant
   2. do not store the plan for inflight instans, such as compaction, 
log_compaction, clustering, etc.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.

2023-07-18 Thread via GitHub



nsivabalan commented on code in PR #9203:
URL: https://github.com/apache/hudi/pull/9203#discussion_r1267481558


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -721,6 +721,8 @@ public class HoodieWriteConfig extends HoodieConfig {
   + "The class must be a subclass of 
`org.apache.hudi.callback.HoodieClientInitCallback`."
   + "By default, no Hudi client init callback is executed.");
 
+  public static final String WRITE_PREPPED_MERGE_KEY = 
"_hoodie.datasource.merge.prepped";
+

Review Comment:
   I am also thinking, from a user standpoint we should have just 1 config to 
enable or disable the optimized flow (irrespective of whether its mIT or 
updates or deletes). 
   but internally we can use diff configs if we wish to differentiate MIT and 
rest. 
   



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##
@@ -642,11 +642,18 @@ object DataSourceWriteOptions {
   val DROP_PARTITION_COLUMNS: ConfigProperty[java.lang.Boolean] = 
HoodieTableConfig.DROP_PARTITION_COLUMNS
 
   val ENABLE_OPTIMIZED_SQL_WRITES: ConfigProperty[String] = ConfigProperty
-.key("hoodie.spark.sql.writes.optimized.enable")
+.key("hoodie.spark.sql.optimized.writes.enable")
 .defaultValue("true")
 .markAdvanced()
 .sinceVersion("0.14.0")
-.withDocumentation("Controls whether spark sql optimized update is 
enabled.")
+.withDocumentation("Controls whether spark sql prepped update and delete 
is enabled.")
+
+  val ENABLE_OPTIMIZED_SQL_MERGE_WRITES: ConfigProperty[String] = 
ConfigProperty

Review Comment:
   I am also thinking, from a user standpoint we should have just 1 config to 
enable or disable the optimized flow (irrespective of whether its mIT or 
updates or deletes).
   but internally we can use diff configs if we wish to differentiate MIT and 
rest.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9226: [HUDI-6352] take actual commit time (StateTransitionTime) into consid…

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9226:
URL: https://github.com/apache/hudi/pull/9226#issuecomment-1641303819

   
   ## CI report:
   
   * c74087e82eb4bec52b33a07679d2ecbc3aba43c9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18676)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] KnightChess commented on pull request #8856: [HUDI-6300] Fix file size parallelism not work when init metadata table

2023-07-18 Thread via GitHub



KnightChess commented on PR #8856:
URL: https://github.com/apache/hudi/pull/8856#issuecomment-1641296291

   @yihua thanks review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9226: [HUDI-6352] take actual commit time (StateTransitionTime) into consid…

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9226:
URL: https://github.com/apache/hudi/pull/9226#issuecomment-1641295713

   
   ## CI report:
   
   * c74087e82eb4bec52b33a07679d2ecbc3aba43c9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] KnightChess commented on pull request #8856: [HUDI-6300] Fix file size parallelism not work when init metadata table

2023-07-18 Thread via GitHub



KnightChess commented on PR #8856:
URL: https://github.com/apache/hudi/pull/8856#issuecomment-1641295662

   > @KnightChess do you have any number on the performance improvement on 
updating MDT from this PR?
   
   parallelism compute:
   ```java
   parallelism = Math.max(Math.min(partitionToAppendedFilesList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
   ```
   @yihua in this picture, total file is more than 5000 with on partitions, 
before it fix, bloom and col stats parallelism is 1, and now, bloom filter is 
200, col stat is 10, which from default value
   
![image](https://github.com/apache/hudi/assets/20125927/ff4cb9e4-d595-4294-83e7-cb42c73c40ff)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6560) Avoid to read instant details 2 times for archiving

2023-07-18 Thread Danny Chen (Jira)

Danny Chen created HUDI-6560:


 Summary: Avoid to read instant details 2 times for archiving
 Key: HUDI-6560
 URL: https://issues.apache.org/jira/browse/HUDI-6560
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: Danny Chen
 Fix For: 0.14.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6352) KEEP_LATEST_BY_HOURS should consider modified time instead of commit time while setting earliestCommitToRetain value

2023-07-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6352:
-
Labels: pull-request-available  (was: )

> KEEP_LATEST_BY_HOURS should consider modified time instead of commit time 
> while setting earliestCommitToRetain value
> 
>
> Key: HUDI-6352
> URL: https://issues.apache.org/jira/browse/HUDI-6352
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Surya Prasanna Yalla
>Priority: Major
>  Labels: pull-request-available
>
> In CleanPlanner, KEEP_LATEST_BY_HOURS is setting earliestCommitToRetain value 
> by consider timestamp directly, this will introduce bug if there are out of 
> order commits where commit with lower timestamp is completed much later than 
> commits with higher timestamps.
> This policy's implementation needs to be revisit.
> It should basically store the timestamp until which it cleaned let this be 
> t1. Next cleaner instant should consider all the partitions and files that 
> are modified from the point of t1 until (currentime-x) hours. Whichever files 
> are not valid they should be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated: [HUDI-6300] Fix file size parallelism not work when init metadata table (#8856)

2023-07-18 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new bce55f0c165 [HUDI-6300] Fix file size parallelism not work when init 
metadata table (#8856)
bce55f0c165 is described below

commit bce55f0c1651949a1dfddaaf343d62cf76574063
Author: KnightChess <981159...@qq.com>
AuthorDate: Wed Jul 19 10:26:13 2023 +0800

[HUDI-6300] Fix file size parallelism not work when init metadata table 
(#8856)

Co-authored-by: Y Ethan Guo 
---
 .../hudi/metadata/HoodieTableMetadataUtil.java | 140 ++---
 1 file changed, 66 insertions(+), 74 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
index cd87f6ff59c..56f478e781c 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
@@ -86,6 +86,7 @@ import java.util.HashSet;
 import java.util.LinkedList;
 import java.util.List;
 import java.util.Map;
+import java.util.Objects;
 import java.util.Set;
 import java.util.function.BiFunction;
 import java.util.function.Function;
@@ -850,59 +851,56 @@ public class HoodieTableMetadataUtil {
   
String instantTime) {
 HoodieData allRecordsRDD = engineContext.emptyHoodieData();
 
-List>> partitionToDeletedFilesList = 
partitionToDeletedFiles.entrySet()
-.stream().map(e -> Pair.of(e.getKey(), 
e.getValue())).collect(Collectors.toList());
-int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
-HoodieData>> partitionToDeletedFilesRDD = 
engineContext.parallelize(partitionToDeletedFilesList, parallelism);
+List> partitionToDeletedFilesList = 
partitionToDeletedFiles.entrySet().stream().flatMap(entry -> {
+  return entry.getValue().stream().map(file -> Pair.of(entry.getKey(), 
file));
+}).collect(Collectors.toList());
 
-HoodieData deletedFilesRecordsRDD = 
partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> {
-  final String partitionName = partitionToDeletedFilesPair.getLeft();
-  final List deletedFileList = 
partitionToDeletedFilesPair.getRight();
-  return deletedFileList.stream().flatMap(deletedFile -> {
-if (!FSUtils.isBaseFile(new Path(deletedFile))) {
-  return Stream.empty();
-}
+int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
+HoodieData> partitionToDeletedFilesRDD = 
engineContext.parallelize(partitionToDeletedFilesList, parallelism);
 
-final String partition = getPartitionIdentifier(partitionName);
-return 
Stream.of(HoodieMetadataPayload.createBloomFilterMetadataRecord(
-partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, 
ByteBuffer.allocate(0), true));
-  }).iterator();
-});
+HoodieData deletedFilesRecordsRDD = 
partitionToDeletedFilesRDD.map(partitionToDeletedFilePair -> {
+  String partitionName = partitionToDeletedFilePair.getLeft();
+  String deletedFile = partitionToDeletedFilePair.getRight();
+  if (!FSUtils.isBaseFile(new Path(deletedFile))) {
+return null;
+  }
+  final String partition = getPartitionIdentifier(partitionName);
+  return (HoodieRecord) 
(HoodieMetadataPayload.createBloomFilterMetadataRecord(
+  partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, 
ByteBuffer.allocate(0), true));
+}).filter(Objects::nonNull);
 allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD);
 
-List>> partitionToAppendedFilesList = 
partitionToAppendedFiles.entrySet()
-.stream().map(entry -> Pair.of(entry.getKey(), 
entry.getValue())).collect(Collectors.toList());
+List> partitionToAppendedFilesList = 
partitionToAppendedFiles.entrySet().stream().flatMap(entry -> {
+  return entry.getValue().keySet().stream().map(file -> 
Pair.of(entry.getKey(), file));
+}).collect(Collectors.toList());
+
 parallelism = Math.max(Math.min(partitionToAppendedFilesList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
-HoodieData>> partitionToAppendedFilesRDD = 
engineContext.parallelize(partitionToAppendedFilesList, parallelism);
+HoodieData> partitionToAppendedFilesRDD = 
engineContext.parallelize(partitionToAppendedFilesList, parallelism);
 
-HoodieData appendedFilesRecordsRDD = 
partitionToAppendedFilesRDD.flatMap(partitionToAppendedFilesPair -> {
-  final String partitionName = partitionToAppendedFilesPair.getLeft();
-  final Map appendedFileMap = 
partitionToAppendedFiles

[GitHub] [hudi] hbgstc123 opened a new pull request, #9226: [HUDI-6352] take actual commit time (StateTransitionTime) into consid…

2023-07-18 Thread via GitHub



hbgstc123 opened a new pull request, #9226:
URL: https://github.com/apache/hudi/pull/9226

   …eration when getting the oldest instant to retain for clustering from 
archival.
   
   According to the current logic of 
`ClusteringUtils#getOldestInstantToRetainForClustering`,  
   if the timeline of a hoodie table is `replace1 commit2 clean3`, the 
earliestInstantToRetain of clean3 is commit2, then replace1 is considered ready 
for archival no matter when it is completed. But if replace1 is completed after 
clean3, then the replaced files in replace1 are not cleaned, so it should not 
be archived. This pr fix such case.
   
   ### Change Logs
   
   Add logic to `ClusteringUtils#getOldestInstantToRetainForClustering`, make 
sure a replace commit not archived if its actual complete time is later than 
the actual complete time of the latest completed clean instant.
   
   ### Impact
   
   none
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua merged pull request #8856: [HUDI-6300] Fix file size parallelism not work when init metadata table

2023-07-18 Thread via GitHub



yihua merged PR #8856:
URL: https://github.com/apache/hudi/pull/8856


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6559) Add Sharing Group for Compaction

2023-07-18 Thread Bo Cui (Jira)

Bo Cui created HUDI-6559:


 Summary: Add Sharing Group for Compaction
 Key: HUDI-6559
 URL: https://issues.apache.org/jira/browse/HUDI-6559
 Project: Apache Hudi
  Issue Type: Improvement
  Components: flink
Reporter: Bo Cui


if compaction is enabled, compaction shares resources with the write operator. 
When compaction is under heavy pressure, the performance of the write operator 
is affected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9224: seems to be working

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9224:
URL: https://github.com/apache/hudi/pull/9224#issuecomment-1641197813

   
   ## CI report:
   
   * 558ee6903fe1985b41ad70205bf648a2b464fc38 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18674)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vijayasarathib opened a new pull request, #9225: Documentation change to Increase readability for basic_configurations

2023-07-18 Thread via GitHub



vijayasarathib opened a new pull request, #9225:
URL: https://github.com/apache/hudi/pull/9225

   Update basic_configurations.md
   
   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] psendyk commented on issue #8890: [SUPPORT] Spark structured streaming ingestion into Hudi fails after an upgrade to 0.12.2

2023-07-18 Thread via GitHub



psendyk commented on issue #8890:
URL: https://github.com/apache/hudi/issues/8890#issuecomment-1641148436

   I tested it again using the options @zyclove posted above and the job still 
fails with the same error. Also, this time I tested it on a fresh table to make 
sure there were no issues with our production table. I ingested ~1B records 
from Kafka to a new S3 location, written to ~18k partitions. So it should be 
reproducible, let me know if you need any additional details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table

2023-07-18 Thread via GitHub



hudi-bot commented on PR #8847:
URL: https://github.com/apache/hudi/pull/8847#issuecomment-1641122278

   
   ## CI report:
   
   * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN
   * 1f8c2e4cb0da6d322b9f03657463b406f189350a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18673)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



CTTY commented on code in PR #9136:
URL: https://github.com/apache/hudi/pull/9136#discussion_r1267380077


##
pom.xml:
##
@@ -2614,6 +2614,18 @@
   
 
 
+
+  java17
+  
+-Xmx2g --add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED 
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED 
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED 
--add-opens=java.base/sun.security.action=ALL-UNNAMED 
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED 
--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED 
-Djol.magicFieldOffset=true

Review Comment:
   That's a good question. I think these would be needed in runtime before we 
can confirm Hudi doesn't use some public Java 8 APIs that later got converted 
to private. But I'm not sure how we can confirm that without compiling Hudi 
with Java 17. Maybe we can try removing some of them to see if tests fail?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9224: seems to be working

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9224:
URL: https://github.com/apache/hudi/pull/9224#issuecomment-1641084419

   
   ## CI report:
   
   * 74d2ddcf295168b82be4a26e383c8e7495487107 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18671)
 
   * 558ee6903fe1985b41ad70205bf648a2b464fc38 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18674)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9136:
URL: https://github.com/apache/hudi/pull/9136#issuecomment-1641084226

   
   ## CI report:
   
   * a0e7207fb19738237d56fa0060c91cb7865ae9c0 UNKNOWN
   * cda1e7724e6267ec471d8c318cd22703a2ecb69f UNKNOWN
   * 0909e9991595a5f6c48181ff8db82a6dbebc49b8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18672)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9224: seems to be working

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9224:
URL: https://github.com/apache/hudi/pull/9224#issuecomment-1641078598

   
   ## CI report:
   
   * 74d2ddcf295168b82be4a26e383c8e7495487107 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18671)
 
   * 558ee6903fe1985b41ad70205bf648a2b464fc38 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



CTTY commented on code in PR #9136:
URL: https://github.com/apache/hudi/pull/9136#discussion_r1267318336


##
pom.xml:
##
@@ -156,7 +156,7 @@
 flink-clients
 
flink-connector-kafka
 
flink-hadoop-compatibility_2.12
-5.17.2
+7.5.3

Review Comment:
   Right, RocksDB `5.17.2` would throw `NoClassDefException` when running 
`TestHoodieLogFormat`
   
   ```
   [ERROR] testBasicAppendAndScanMultipleFiles{DiskMapType, boolean, boolean, 
boolean}[10]  Time elapsed: 0.118 s  <<< ERROR!
   2023-07-13T23:41:36.1420947Z java.lang.NoClassDefFoundError: Could not 
initialize class org.rocksdb.DBOptions
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



CTTY commented on code in PR #9136:
URL: https://github.com/apache/hudi/pull/9136#discussion_r1267359024


##
packaging/bundle-validation/ci_run.sh:
##
@@ -110,95 +112,116 @@ fi
 TMP_JARS_DIR=/tmp/jars/$(date +%s)
 mkdir -p $TMP_JARS_DIR
 
-if [[ "$HUDI_VERSION" == *"SNAPSHOT" ]]; then
-  cp 
${GITHUB_WORKSPACE}/packaging/hudi-flink-bundle/target/hudi-*-$HUDI_VERSION.jar 
$TMP_JARS_DIR/
-  cp 
${GITHUB_WORKSPACE}/packaging/hudi-hadoop-mr-bundle/target/hudi-*-$HUDI_VERSION.jar
 $TMP_JARS_DIR/
-  cp 
${GITHUB_WORKSPACE}/packaging/hudi-kafka-connect-bundle/target/hudi-*-$HUDI_VERSION.jar
 $TMP_JARS_DIR/
-  cp 
${GITHUB_WORKSPACE}/packaging/hudi-spark-bundle/target/hudi-*-$HUDI_VERSION.jar 
$TMP_JARS_DIR/
-  cp 
${GITHUB_WORKSPACE}/packaging/hudi-utilities-bundle/target/hudi-*-$HUDI_VERSION.jar
 $TMP_JARS_DIR/
-  cp 
${GITHUB_WORKSPACE}/packaging/hudi-utilities-slim-bundle/target/hudi-*-$HUDI_VERSION.jar
 $TMP_JARS_DIR/
-  cp 
${GITHUB_WORKSPACE}/packaging/hudi-metaserver-server-bundle/target/hudi-*-$HUDI_VERSION.jar
 $TMP_JARS_DIR/
-  echo 'Validating jars below:'
-else
-  echo 'Adding environment variables for bundles in the release candidate'
-
-  HUDI_HADOOP_MR_BUNDLE_NAME=hudi-hadoop-mr-bundle
-  HUDI_KAFKA_CONNECT_BUNDLE_NAME=hudi-kafka-connect-bundle
-  HUDI_METASERVER_SERVER_BUNDLE_NAME=hudi-metaserver-server-bundle
-
-  if [[ ${SPARK_PROFILE} == 'spark' ]]; then
-HUDI_SPARK_BUNDLE_NAME=hudi-spark-bundle_2.11
-HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.11
-HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.11
-  elif [[ ${SPARK_PROFILE} == 'spark2.4' ]]; then
-HUDI_SPARK_BUNDLE_NAME=hudi-spark2.4-bundle_2.11
-HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.11
-HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.11
-  elif [[ ${SPARK_PROFILE} == 'spark3.1' ]]; then
-HUDI_SPARK_BUNDLE_NAME=hudi-spark3.1-bundle_2.12
-HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.12
-HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.12
-  elif [[ ${SPARK_PROFILE} == 'spark3.2' ]]; then
-HUDI_SPARK_BUNDLE_NAME=hudi-spark3.2-bundle_2.12
-HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.12
-HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.12
-  elif [[ ${SPARK_PROFILE} == 'spark3.3' ]]; then
-HUDI_SPARK_BUNDLE_NAME=hudi-spark3.3-bundle_2.12
-HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.12
-HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.12
-  elif [[ ${SPARK_PROFILE} == 'spark3' ]]; then
-HUDI_SPARK_BUNDLE_NAME=hudi-spark3-bundle_2.12
-HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.12
-HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.12
-  fi
+if [[ -z "$MODE" ]] || [[ "$MODE" != "java17" ]]; then
+  if [[ "$HUDI_VERSION" == *"SNAPSHOT" ]]; then
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-flink-bundle/target/hudi-*-$HUDI_VERSION.jar 
$TMP_JARS_DIR/
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-hadoop-mr-bundle/target/hudi-*-$HUDI_VERSION.jar
 $TMP_JARS_DIR/
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-kafka-connect-bundle/target/hudi-*-$HUDI_VERSION.jar
 $TMP_JARS_DIR/
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-spark-bundle/target/hudi-*-$HUDI_VERSION.jar 
$TMP_JARS_DIR/
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-utilities-bundle/target/hudi-*-$HUDI_VERSION.jar
 $TMP_JARS_DIR/
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-utilities-slim-bundle/target/hudi-*-$HUDI_VERSION.jar
 $TMP_JARS_DIR/
+cp 
${GITHUB_WORKSPACE}/packaging/hudi-metaserver-server-bundle/target/hudi-*-$HUDI_VERSION.jar
 $TMP_JARS_DIR/
+echo 'Validating jars below:'
+  else
+echo 'Adding environment variables for bundles in the release candidate'
+
+HUDI_HADOOP_MR_BUNDLE_NAME=hudi-hadoop-mr-bundle
+HUDI_KAFKA_CONNECT_BUNDLE_NAME=hudi-kafka-connect-bundle
+HUDI_METASERVER_SERVER_BUNDLE_NAME=hudi-metaserver-server-bundle
+
+if [[ ${SPARK_PROFILE} == 'spark' ]]; then
+  HUDI_SPARK_BUNDLE_NAME=hudi-spark-bundle_2.11
+  HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.11
+  HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.11
+elif [[ ${SPARK_PROFILE} == 'spark2.4' ]]; then
+  HUDI_SPARK_BUNDLE_NAME=hudi-spark2.4-bundle_2.11
+  HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.11
+  HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.11
+elif [[ ${SPARK_PROFILE} == 'spark3.1' ]]; then
+  HUDI_SPARK_BUNDLE_NAME=hudi-spark3.1-bundle_2.12
+  HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.12
+  HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.12
+elif [[ ${SPARK_PROFILE} == 'spark3.2' ]]; then
+  HUDI_SPARK_BUNDLE_NAME=hudi-spark3.2-bundle_2.12
+  HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.12
+  HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.12
+elif [[ ${SPARK_PROFILE} == 'spark3.3' ]]; then
+  HUDI_SPARK_BUNDLE_NAME=hudi-spark3.3-bund

[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



CTTY commented on code in PR #9136:
URL: https://github.com/apache/hudi/pull/9136#discussion_r1267358058


##
.github/workflows/bot.yml:
##
@@ -112,6 +112,91 @@ jobs:
 run:
   mvn test -Pfunctional-tests -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
 
+  test-spark-java17:
+runs-on: ubuntu-latest
+strategy:
+  matrix:
+include:
+  - scalaProfile: "scala-2.12"
+sparkProfile: "spark3.3"
+sparkModules: "hudi-spark-datasource/hudi-spark3.3.x"
+  - scalaProfile: "scala-2.12"
+sparkProfile: "spark3.4"
+sparkModules: "hudi-spark-datasource/hudi-spark3.4.x"
+
+steps:
+  - uses: actions/checkout@v3
+  - name: Set up JDK 8
+uses: actions/setup-java@v3
+with:
+  java-version: '8'
+  distribution: 'adopt'
+  architecture: x64
+  - name: Build Project
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+run:
+  mvn clean install -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-DskipTests=true $MVN_ARGS
+  - name: Set up JDK 17
+uses: actions/setup-java@v3
+with:
+  java-version: '17'
+  distribution: 'adopt'
+  architecture: x64
+  - name: Quickstart Test
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+run:
+  mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-pl hudi-examples/hudi-examples-spark $MVN_ARGS
+  - name: UT - Common & Spark
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+  SPARK_MODULES: ${{ matrix.sparkModules }}
+if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 
as it's covered by Azure CI
+run:
+  mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-pl "hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
+  - name: FT - Spark
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+  SPARK_MODULES: ${{ matrix.sparkModules }}
+if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 
as it's covered by Azure CI
+run:
+  mvn test -Pfunctional-tests -Pjava17 -D"$SCALA_PROFILE" 
-D"$SPARK_PROFILE" -pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
+
+  docker-test-java17:

Review Comment:
   those tests are generally the same as bundle validation but there is one 
difference: It requires building Hudi within Docker as it runs tests with `mvn 
test` command, while bundle validation build Hudi outside of Docker and only 
copy jars/bundles to Docker. If we consolidates them as one then the job would 
need to build twice, it would make the job much slower. 
   
   If we just seperate them into 2 jobs then docker test can only build 
`hudi-common` modules in Docker which is relatively fast and bundle validation 
can keep the same behavior



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



CTTY commented on code in PR #9136:
URL: https://github.com/apache/hudi/pull/9136#discussion_r1267351602


##
hudi-common/pom.xml:
##
@@ -248,6 +248,13 @@
   
 
 
+
+  org.apache.spark
+  
spark-streaming-kafka-0-10_${scala.binary.version}
+  test
+  ${spark.version}
+

Review Comment:
   I can't remember exactly, but I think there were some issues when this is 
removed. Will need to double check



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



CTTY commented on code in PR #9136:
URL: https://github.com/apache/hudi/pull/9136#discussion_r1267352893


##
.github/workflows/bot.yml:
##
@@ -112,6 +112,91 @@ jobs:
 run:
   mvn test -Pfunctional-tests -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
 
+  test-spark-java17:
+runs-on: ubuntu-latest
+strategy:
+  matrix:
+include:
+  - scalaProfile: "scala-2.12"
+sparkProfile: "spark3.3"
+sparkModules: "hudi-spark-datasource/hudi-spark3.3.x"
+  - scalaProfile: "scala-2.12"
+sparkProfile: "spark3.4"
+sparkModules: "hudi-spark-datasource/hudi-spark3.4.x"
+
+steps:
+  - uses: actions/checkout@v3
+  - name: Set up JDK 8
+uses: actions/setup-java@v3
+with:
+  java-version: '8'
+  distribution: 'adopt'
+  architecture: x64
+  - name: Build Project
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+run:
+  mvn clean install -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-DskipTests=true $MVN_ARGS
+  - name: Set up JDK 17
+uses: actions/setup-java@v3
+with:
+  java-version: '17'
+  distribution: 'adopt'
+  architecture: x64
+  - name: Quickstart Test
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+run:
+  mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-pl hudi-examples/hudi-examples-spark $MVN_ARGS
+  - name: UT - Common & Spark
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+  SPARK_MODULES: ${{ matrix.sparkModules }}
+if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 
as it's covered by Azure CI
+run:
+  mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-pl "hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
+  - name: FT - Spark
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+  SPARK_MODULES: ${{ matrix.sparkModules }}
+if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 
as it's covered by Azure CI
+run:
+  mvn test -Pfunctional-tests -Pjava17 -D"$SCALA_PROFILE" 
-D"$SPARK_PROFILE" -pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
+
+  docker-test-java17:
+runs-on: ubuntu-latest
+strategy:
+  matrix:
+include:
+  - flinkProfile: 'flink1.17'
+sparkProfile: 'spark3.4'
+sparkRuntime: 'spark3.4.0'

Review Comment:
   Existing bundle validation uses Spark 3.4.0 still. I guess we can bump it 
but should we use a seperate PR or?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



CTTY commented on code in PR #9136:
URL: https://github.com/apache/hudi/pull/9136#discussion_r1267351602


##
hudi-common/pom.xml:
##
@@ -248,6 +248,13 @@
   
 
 
+
+  org.apache.spark
+  
spark-streaming-kafka-0-10_${scala.binary.version}
+  test
+  ${spark.version}
+

Review Comment:
   I can't remember offhand, but I remember there were some issues when this is 
removed. Will need to double check



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



CTTY commented on code in PR #9136:
URL: https://github.com/apache/hudi/pull/9136#discussion_r1267350100


##
hudi-common/src/test/java/org/apache/hudi/avro/TestHoodieAvroUtils.java:
##
@@ -450,10 +450,8 @@ public void testGenerateProjectionSchema() {
 assertTrue(fieldNames1.contains("_row_key"));
 assertTrue(fieldNames1.contains("timestamp"));
 
-assertEquals("Field fake_field not found in log schema. Query cannot 
proceed! Derived Schema Fields: "
-+ "[non_pii_col, _hoodie_commit_time, _row_key, 
_hoodie_partition_path, _hoodie_record_key, pii_col,"
-+ " _hoodie_commit_seqno, _hoodie_file_name, timestamp]",
-assertThrows(HoodieException.class, () ->
-HoodieAvroUtils.generateProjectionSchema(originalSchema, 
Arrays.asList("_row_key", "timestamp", "fake_field"))).getMessage());
+assertTrue(assertThrows(HoodieException.class, () ->
+HoodieAvroUtils.generateProjectionSchema(originalSchema, 
Arrays.asList("_row_key", "timestamp", "fake_field")))
+.getMessage().contains("Field fake_field not found in log schema. 
Query cannot proceed!"));

Review Comment:
   The order of results seems to change in Java 17, but the result is the same. 
Ref: https://github.com/apache/hudi/pull/8955#issuecomment-1624527608



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] soumilshah1995 closed issue #9210: [SUPPORT] Apache Hudi Partition Compaction

2023-07-18 Thread via GitHub



soumilshah1995 closed issue #9210: [SUPPORT] Apache Hudi Partition Compaction 
URL: https://github.com/apache/hudi/issues/9210


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



CTTY commented on code in PR #9136:
URL: https://github.com/apache/hudi/pull/9136#discussion_r1267318336


##
pom.xml:
##
@@ -156,7 +156,7 @@
 flink-clients
 
flink-connector-kafka
 
flink-hadoop-compatibility_2.12
-5.17.2
+7.5.3

Review Comment:
   Right, RocksDB `5.17.2` would throw `NoClassDefException` when running 
`TestHoodieLogFormat`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9224: seems to be working

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9224:
URL: https://github.com/apache/hudi/pull/9224#issuecomment-1640957631

   
   ## CI report:
   
   * 74d2ddcf295168b82be4a26e383c8e7495487107 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18671)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table

2023-07-18 Thread via GitHub



hudi-bot commented on PR #8856:
URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640956642

   
   ## CI report:
   
   * 5dc00a2d02cca3b242a54c3294ef3c30d6a66b3f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18670)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9136:
URL: https://github.com/apache/hudi/pull/9136#issuecomment-1640947990

   
   ## CI report:
   
   * a0e7207fb19738237d56fa0060c91cb7865ae9c0 UNKNOWN
   * cda1e7724e6267ec471d8c318cd22703a2ecb69f UNKNOWN
   * 73d4660734fbcf528b482df2460944ba51431eea Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18645)
 
   * 0909e9991595a5f6c48181ff8db82a6dbebc49b8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18672)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6558) Support SQL Update for CoW when no precombine field is defined

2023-07-18 Thread kazdy (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kazdy updated HUDI-6558:

Description: Updates without precombine field (for COW only) is already 
supported in MERGE INTO

> Support SQL Update for CoW when no precombine field is defined
> --
>
> Key: HUDI-6558
> URL: https://issues.apache.org/jira/browse/HUDI-6558
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: kazdy
>Priority: Major
>
> Updates without precombine field (for COW only) is already supported in MERGE 
> INTO



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] yihua commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



yihua commented on code in PR #9136:
URL: https://github.com/apache/hudi/pull/9136#discussion_r1267265487


##
pom.xml:
##
@@ -2614,6 +2614,18 @@
   
 
 
+
+  java17
+  
+-Xmx2g --add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED 
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED 
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED 
--add-opens=java.base/sun.security.action=ALL-UNNAMED 
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED 
--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED 
-Djol.magicFieldOffset=true

Review Comment:
   Are these args needed for running production jobs on Java 17?



##
pom.xml:
##
@@ -156,7 +156,7 @@
 flink-clients
 
flink-connector-kafka
 
flink-hadoop-compatibility_2.12
-5.17.2
+7.5.3

Review Comment:
   does RocksDB `5.17.2` not work?  Dependency version upgrade has larger 
impact.



##
packaging/bundle-validation/docker_test_java17.sh:
##
@@ -0,0 +1,170 @@
+#!/bin/bash
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#
+# NOTE: this script runs inside hudi-ci-bundle-validation container
+# $WORKDIR/jars/ is to mount to a host directory where bundle jars are placed
+# $WORKDIR/data/ is to mount to a host directory where test data are placed 
with structures like
+#- /schema.avsc
+#- /data/
+#
+

Review Comment:
   Could we consolidate the test logic of this into `validate.sh` and reuse 
existing validate-bundle job?



##
style/checkstyle.xml:
##
@@ -269,7 +269,7 @@
 
 
 
+  value="^java\.util\.Optional, 
^org\.junit\.(?!jupiter|platform|contrib|Rule|runner|Assume)(.*)"/>

Review Comment:
   Let's use the juniper version instead of changing this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6558) Support SQL Update for CoW when no precombine field is defined

2023-07-18 Thread kazdy (Jira)

kazdy created HUDI-6558:
---

 Summary: Support SQL Update for CoW when no precombine field is 
defined
 Key: HUDI-6558
 URL: https://issues.apache.org/jira/browse/HUDI-6558
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: kazdy






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] Sam-Serpoosh commented on issue #9143: [SUPPORT] Failure to delete records with missing attributes from PostgresDebeziumSource

2023-07-18 Thread via GitHub



Sam-Serpoosh commented on issue #9143:
URL: https://github.com/apache/hudi/issues/9143#issuecomment-1640911318

   @ad1happy2go  Looks like `REPLICA IDENTITY FULL` is mostly discouraged by PG 
([interesting article](https://xata.io/blog/replica-identity-full-performance) 
and [SO Thread](https://stackoverflow.com/a/67979022/1433222)). It would be 
**ideal** not to have to change this setting to `FULL` to avoid the downsides.
   
   I know Hudi has the limitation on **global uniqueness** when dealing with 
**partitioned Hudi Tables**. So is there any way to make this work with 
**partitioned Hudi Tables** without having to set REPLICA IDENTITY to `FULL`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Comment Edited] (HUDI-6556) Big Query sync with master code failing for partitioned table with the Exception

2023-07-18 Thread Aditya Goenka (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744345#comment-17744345
 ] 

Aditya Goenka edited comment on HUDI-6556 at 7/18/23 7:53 PM:
--

While testing, I was using the partitionPath as slash encoded which was causing 
this issue,

Its working as expected when I updated partition column so closing this issue.


was (Author: JIRAUSER299651):
While testing, I was using the partitionPath as slash encoded which was causing 
this issue,

Its working as expected so closing this issue.

> Big Query sync with master code failing for partitioned table with the 
> Exception
> 
>
> Key: HUDI-6556
> URL: https://issues.apache.org/jira/browse/HUDI-6556
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: Aditya Goenka
>Priority: Blocker
>  Labels: 0.14.0
>
> While doing Big Query Sync for partitioned table, its failing with below 
> Exception - 
> error message: Failed to add partition key partitionpath (type: TYPE_STRING) 
> to schema, because another column with the same name was already present. 
> This is not allowed. Full partition schema: [partitionpath:TYPE_STRING]."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6556) Big Query sync with master code failing for partitioned table with the Exception

2023-07-18 Thread Aditya Goenka (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Goenka closed HUDI-6556.
---
Resolution: Not A Problem

> Big Query sync with master code failing for partitioned table with the 
> Exception
> 
>
> Key: HUDI-6556
> URL: https://issues.apache.org/jira/browse/HUDI-6556
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: Aditya Goenka
>Priority: Blocker
>  Labels: 0.14.0
>
> While doing Big Query Sync for partitioned table, its failing with below 
> Exception - 
> error message: Failed to add partition key partitionpath (type: TYPE_STRING) 
> to schema, because another column with the same name was already present. 
> This is not allowed. Full partition schema: [partitionpath:TYPE_STRING]."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-6556) Big Query sync with master code failing for partitioned table with the Exception

2023-07-18 Thread Aditya Goenka (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744345#comment-17744345
 ] 

Aditya Goenka commented on HUDI-6556:
-

While testing, I was using the partitionPath as slash encoded which was causing 
this issue,

Its working as expected so closing this issue.

> Big Query sync with master code failing for partitioned table with the 
> Exception
> 
>
> Key: HUDI-6556
> URL: https://issues.apache.org/jira/browse/HUDI-6556
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: Aditya Goenka
>Priority: Blocker
>  Labels: 0.14.0
>
> While doing Big Query Sync for partitioned table, its failing with below 
> Exception - 
> error message: Failed to add partition key partitionpath (type: TYPE_STRING) 
> to schema, because another column with the same name was already present. 
> This is not allowed. Full partition schema: [partitionpath:TYPE_STRING]."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] ad1happy2go commented on issue #9042: [SUPPORT] Cannot write nullable values to non-null column

2023-07-18 Thread via GitHub



ad1happy2go commented on issue #9042:
URL: https://github.com/apache/hudi/issues/9042#issuecomment-1640905833

   yes , I also confirmed with master I am not seeing this issue.
   
   @dht7 Can you check with master code if possible if you are still facing 
this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table

2023-07-18 Thread via GitHub



hudi-bot commented on PR #8847:
URL: https://github.com/apache/hudi/pull/8847#issuecomment-1640904862

   
   ## CI report:
   
   * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN
   * 29abf1ce1345bfe299685fcc3b496f365f109e76 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18664)
 
   * 1f8c2e4cb0da6d322b9f03657463b406f189350a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18673)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



yihua commented on code in PR #9136:
URL: https://github.com/apache/hudi/pull/9136#discussion_r1267072983


##
.github/workflows/bot.yml:
##
@@ -112,6 +112,91 @@ jobs:
 run:
   mvn test -Pfunctional-tests -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
 
+  test-spark-java17:
+runs-on: ubuntu-latest
+strategy:
+  matrix:
+include:
+  - scalaProfile: "scala-2.12"
+sparkProfile: "spark3.3"
+sparkModules: "hudi-spark-datasource/hudi-spark3.3.x"
+  - scalaProfile: "scala-2.12"
+sparkProfile: "spark3.4"
+sparkModules: "hudi-spark-datasource/hudi-spark3.4.x"
+
+steps:
+  - uses: actions/checkout@v3
+  - name: Set up JDK 8
+uses: actions/setup-java@v3
+with:
+  java-version: '8'
+  distribution: 'adopt'
+  architecture: x64
+  - name: Build Project
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+run:
+  mvn clean install -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-DskipTests=true $MVN_ARGS
+  - name: Set up JDK 17
+uses: actions/setup-java@v3
+with:
+  java-version: '17'
+  distribution: 'adopt'
+  architecture: x64
+  - name: Quickstart Test
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+run:
+  mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-pl hudi-examples/hudi-examples-spark $MVN_ARGS
+  - name: UT - Common & Spark
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+  SPARK_MODULES: ${{ matrix.sparkModules }}
+if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 
as it's covered by Azure CI
+run:
+  mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-pl "hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
+  - name: FT - Spark
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+  SPARK_MODULES: ${{ matrix.sparkModules }}
+if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 
as it's covered by Azure CI
+run:
+  mvn test -Pfunctional-tests -Pjava17 -D"$SCALA_PROFILE" 
-D"$SPARK_PROFILE" -pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
+
+  docker-test-java17:

Review Comment:
   Could this be run with `validate-bundles` since it already validates bundles 
on Java 17?  Any reason to have a separate job here?



##
.github/workflows/bot.yml:
##
@@ -112,6 +112,91 @@ jobs:
 run:
   mvn test -Pfunctional-tests -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
 
+  test-spark-java17:
+runs-on: ubuntu-latest
+strategy:
+  matrix:
+include:
+  - scalaProfile: "scala-2.12"
+sparkProfile: "spark3.3"
+sparkModules: "hudi-spark-datasource/hudi-spark3.3.x"
+  - scalaProfile: "scala-2.12"
+sparkProfile: "spark3.4"
+sparkModules: "hudi-spark-datasource/hudi-spark3.4.x"
+
+steps:
+  - uses: actions/checkout@v3
+  - name: Set up JDK 8
+uses: actions/setup-java@v3
+with:
+  java-version: '8'
+  distribution: 'adopt'
+  architecture: x64
+  - name: Build Project
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+run:
+  mvn clean install -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-DskipTests=true $MVN_ARGS
+  - name: Set up JDK 17
+uses: actions/setup-java@v3
+with:
+  java-version: '17'
+  distribution: 'adopt'
+  architecture: x64
+  - name: Quickstart Test
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+run:
+  mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-pl hudi-examples/hudi-examples-spark $MVN_ARGS
+  - name: UT - Common & Spark
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+  SPARK_MODULES: ${{ matrix.sparkModules }}
+if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 
as it's covered by Azure CI
+run:
+  mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" 
-pl "hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS
+  - name: FT - Spark
+env:
+  SCALA_PROFILE: ${{ matrix.scalaProfile }}
+  SPARK_PROFILE: ${{ matrix.sparkProfile }}
+  SPARK_MODULES: ${{ matrix.spark

[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9203:
URL: https://github.com/apache/hudi/pull/9203#issuecomment-1640897010

   
   ## CI report:
   
   * 585935c37efc35994dd721ba2d8f05c9cf775470 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18667)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18665)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18669)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9136:
URL: https://github.com/apache/hudi/pull/9136#issuecomment-1640896780

   
   ## CI report:
   
   * a0e7207fb19738237d56fa0060c91cb7865ae9c0 UNKNOWN
   * cda1e7724e6267ec471d8c318cd22703a2ecb69f UNKNOWN
   * 73d4660734fbcf528b482df2460944ba51431eea Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18645)
 
   * 0909e9991595a5f6c48181ff8db82a6dbebc49b8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table

2023-07-18 Thread via GitHub



hudi-bot commented on PR #8847:
URL: https://github.com/apache/hudi/pull/8847#issuecomment-1640896014

   
   ## CI report:
   
   * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN
   * 29abf1ce1345bfe299685fcc3b496f365f109e76 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18664)
 
   * 1f8c2e4cb0da6d322b9f03657463b406f189350a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] amrishlal commented on issue #9042: [SUPPORT] Cannot write nullable values to non-null column

2023-07-18 Thread via GitHub



amrishlal commented on issue #9042:
URL: https://github.com/apache/hudi/issues/9042#issuecomment-1640857564

   @ad1happy2go I am not able to reproduce the issue against the latest master 
version of hudi using either spark-3.1 and spark-3.2 using the steps you 
outlined. Do we know if this issue is limited only to older version of Hudi 
(version : 0.12.2 as reported in the description)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9224: seems to be working

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9224:
URL: https://github.com/apache/hudi/pull/9224#issuecomment-1640765610

   
   ## CI report:
   
   * 74d2ddcf295168b82be4a26e383c8e7495487107 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18671)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9224: seems to be working

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9224:
URL: https://github.com/apache/hudi/pull/9224#issuecomment-1640753251

   
   ## CI report:
   
   * 74d2ddcf295168b82be4a26e383c8e7495487107 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-6315) Optimize UPSERT and DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-18 Thread Amrish Lal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amrish Lal closed HUDI-6315.

Resolution: Done

> Optimize UPSERT and DELETE codepath to use meta fields instead of key 
> generation and index lookup
> -
>
> Key: HUDI-6315
> URL: https://issues.apache.org/jira/browse/HUDI-6315
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Amrish Lal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> For MIT, Update and Delete, we do a look up in hudi to find matching records 
> based o the predicates and then trigger the writes following it. But the 
> records fetched from hudi already contains all meta fields that is required 
> for key generation and index look up (like the record key, partition path, 
> filename, commit time). But as of now, we drop those meta fields and trigger 
> an upsert to hudi (as though someone is writing via spark-datasource). This 
> goes via regular code path of key generation and index lookup which is 
> unnecessary. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-6315) Optimize UPSERT and DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-18 Thread Amrish Lal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744323#comment-17744323
 ] 

Amrish Lal commented on HUDI-6315:
--

Issue has been resolved using the pull requests linkedin to this ticket.

> Optimize UPSERT and DELETE codepath to use meta fields instead of key 
> generation and index lookup
> -
>
> Key: HUDI-6315
> URL: https://issues.apache.org/jira/browse/HUDI-6315
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Amrish Lal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> For MIT, Update and Delete, we do a look up in hudi to find matching records 
> based o the predicates and then trigger the writes following it. But the 
> records fetched from hudi already contains all meta fields that is required 
> for key generation and index look up (like the record key, partition path, 
> filename, commit time). But as of now, we drop those meta fields and trigger 
> an upsert to hudi (as though someone is writing via spark-datasource). This 
> goes via regular code path of key generation and index lookup which is 
> unnecessary. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-6315) Optimize UPSERT and DELETE codepath to use meta fields instead of key generation and index lookup

2023-07-18 Thread Amrish Lal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amrish Lal resolved HUDI-6315.
--

> Optimize UPSERT and DELETE codepath to use meta fields instead of key 
> generation and index lookup
> -
>
> Key: HUDI-6315
> URL: https://issues.apache.org/jira/browse/HUDI-6315
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Amrish Lal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> For MIT, Update and Delete, we do a look up in hudi to find matching records 
> based o the predicates and then trigger the writes following it. But the 
> records fetched from hudi already contains all meta fields that is required 
> for key generation and index look up (like the record key, partition path, 
> filename, commit time). But as of now, we drop those meta fields and trigger 
> an upsert to hudi (as though someone is writing via spark-datasource). This 
> goes via regular code path of key generation and index lookup which is 
> unnecessary. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] jonvex opened a new pull request, #9224: seems to be working

2023-07-18 Thread via GitHub



jonvex opened a new pull request, #9224:
URL: https://github.com/apache/hudi/pull/9224

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table

2023-07-18 Thread via GitHub



hudi-bot commented on PR #8856:
URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640675026

   
   ## CI report:
   
   * 8b6fba8468a155d39a66dc57acb6ac8c5e29b294 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18668)
 
   * 5dc00a2d02cca3b242a54c3294ef3c30d6a66b3f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18670)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table

2023-07-18 Thread via GitHub



hudi-bot commented on PR #8847:
URL: https://github.com/apache/hudi/pull/8847#issuecomment-1640665241

   
   ## CI report:
   
   * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN
   * 29abf1ce1345bfe299685fcc3b496f365f109e76 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18664)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table

2023-07-18 Thread via GitHub



hudi-bot commented on PR #8856:
URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640665363

   
   ## CI report:
   
   * 2d4e285ba5ef3c5b07ec91af6ab3a2669d2b485d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17565)
 
   * 8b6fba8468a155d39a66dc57acb6ac8c5e29b294 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18668)
 
   * 5dc00a2d02cca3b242a54c3294ef3c30d6a66b3f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ehurheap commented on issue #9079: [SUPPORT] Hudi delete not working when using UuidKeyGenerator

2023-07-18 Thread via GitHub



ehurheap commented on issue #9079:
URL: https://github.com/apache/hudi/issues/9079#issuecomment-1640664456

   Yes, the workaround using the writeClient that we discussed in 
[slack](https://apache-hudi.slack.com/archives/C4D716NPQ/p1689111633808279?thread_ts=1687983367.526889&cid=C4D716NPQ)
 worked for me.
   
   Here is a summary:
   
   we build the writeClient:
   
 ```
   def buildWriteClient(): SparkRDDWriteClient[_] = {
   
   val lockProperties = new Properties() // populate lockProperties as 
appropriate
   val metricsProperties = new Properties() // populate metricsProperties 
as appropriate
   
   val writerConfig = HoodieWriteConfig
 .newBuilder()
 .withCompactionConfig(
   HoodieCompactionConfig
 .newBuilder()
 .withInlineCompaction(true)
 .withScheduleInlineCompaction(false)
 .withMaxNumDeltaCommitsBeforeCompaction(1)
 .build()
 )
 
.withArchivalConfig(HoodieArchivalConfig.newBuilder().withAutoArchive(false).build())
 
.withCleanConfig(HoodieCleanConfig.newBuilder().withAutoClean(false).build())
 
.withMetadataConfig(HoodieMetadataConfig.newBuilder().enable(false).build())
 
.withLockConfig(HoodieLockConfig.newBuilder().fromProperties(lockProperties).build())
 
.withMetricsConfig(HoodieMetricsConfig.newBuilder().fromProperties(metricsProperties).build())
 .withDeleteParallelism(config.deleteParallelism)
 .withPath(config.tablePath)
 .forTable(datalakeRecord.tableName)
 .build()
   val engineContext: HoodieEngineContext = new HoodieSparkEngineContext(
 JavaSparkContext.fromSparkContext(sparkContext)
   )
   new SparkRDDWriteClient(engineContext, writerConfig)
 }
   
   ```
   Then run delete and compaction for the specified keys:
   
   ```
   var deleteInstant: String = ""
   try {
 deleteInstant = writeClient.startCommit()
 writeClient.delete(keysToDelete, deleteInstant)
 // :TRICKY: explicitly calling compaction here: although the write 
client was configured to auto compact in-line, compaction is not in fact 
triggered by this delete operation.
 val maybeCompactionInstant =
   
writeClient.scheduleCompaction(org.apache.hudi.common.util.Option.empty())
 if (maybeCompactionInstant.isPresent) 
writeClient.compact(maybeCompactionInstant.get)
 else
   log.warn(
 s"Unable to schedule compaction after delete operation at instant 
${deleteInstant}"
   )
   } catch {
 case t: Throwable =>
   logErrorAndExit(s"Delete operation failed for instant 
${deleteInstant} due to ", t)
   } finally {
 log.info(s"Finished delete operation for instant ${deleteInstant}")
 writeClient.close()
   }
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] CTTY commented on pull request #9136: [HUDI-6509] Add GitHub CI for Java 17

2023-07-18 Thread via GitHub



CTTY commented on PR #9136:
URL: https://github.com/apache/hudi/pull/9136#issuecomment-1640605864

   It seems `validate-bundles(flink1.17, Spark 3.4, Spark 3.4.0)` just 
consistently fail with JDK17 on issue below:
   ```
   Connecting to jdbc:hive2://localhost:1/default
   23/07/17 17:45:48 [main]: WARN jdbc.HiveConnection: Failed to connect to 
localhost:1
   Could not open connection to the HS2 server. Please check the server URI and 
if the URI is correct, then ask the administrator to check the server status.
   Error: Could not open client transport with JDBC Uri: 
jdbc:hive2://localhost:1/default: java.net.ConnectException: Connection 
refused (Connection refused) (state=08S01,code=0)
   Cannot run commands specified using -e. No current connection
   Error: validate.sh HiveQL validation failed.
   Error: Process completed with exit code 1.
   ```
   
   Need to look into this, otherwise everything looks good. newly added 
docker-test-java17 and test-spark-java17 are working fine


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table

2023-07-18 Thread via GitHub



yihua commented on PR #8856:
URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640577138

   @KnightChess do you have any number on the performance improvement on 
updating MDT from this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table

2023-07-18 Thread via GitHub



yihua commented on code in PR #8856:
URL: https://github.com/apache/hudi/pull/8856#discussion_r1267028523


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -850,59 +851,58 @@ public static HoodieData 
convertFilesToBloomFilterRecords(HoodieEn
   
String instantTime) {
 HoodieData allRecordsRDD = engineContext.emptyHoodieData();
 
-List>> partitionToDeletedFilesList = 
partitionToDeletedFiles.entrySet()
-.stream().map(e -> Pair.of(e.getKey(), 
e.getValue())).collect(Collectors.toList());
-int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
-HoodieData>> partitionToDeletedFilesRDD = 
engineContext.parallelize(partitionToDeletedFilesList, parallelism);
+List> partitionToDeletedFilesList = 
partitionToDeletedFiles.entrySet().stream().flatMap(entry -> {
+  List filesList = entry.getValue();
+  return filesList.stream().map(file -> Pair.of(entry.getKey(), file));
+}).collect(Collectors.toList());
 
-HoodieData deletedFilesRecordsRDD = 
partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> {
-  final String partitionName = partitionToDeletedFilesPair.getLeft();
-  final List deletedFileList = 
partitionToDeletedFilesPair.getRight();
-  return deletedFileList.stream().flatMap(deletedFile -> {
-if (!FSUtils.isBaseFile(new Path(deletedFile))) {
-  return Stream.empty();
-}
+int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
+HoodieData> partitionToDeletedFilesRDD = 
engineContext.parallelize(partitionToDeletedFilesList, parallelism);
 
-final String partition = getPartitionIdentifier(partitionName);
-return 
Stream.of(HoodieMetadataPayload.createBloomFilterMetadataRecord(
-partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, 
ByteBuffer.allocate(0), true));
-  }).iterator();
-});
+HoodieData deletedFilesRecordsRDD = 
partitionToDeletedFilesRDD.map(partitionToDeletedFilePair -> {
+  String partitionName = partitionToDeletedFilePair.getLeft();
+  String deletedFile = partitionToDeletedFilePair.getRight();
+  if (!FSUtils.isBaseFile(new Path(deletedFile))) {
+return null;
+  }
+  final String partition = getPartitionIdentifier(partitionName);
+  return (HoodieRecord) 
(HoodieMetadataPayload.createBloomFilterMetadataRecord(
+  partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, 
ByteBuffer.allocate(0), true));
+}).filter(Objects::nonNull);
 allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD);
 
-List>> partitionToAppendedFilesList = 
partitionToAppendedFiles.entrySet()
-.stream().map(entry -> Pair.of(entry.getKey(), 
entry.getValue())).collect(Collectors.toList());
+List> partitionToAppendedFilesList = 
partitionToAppendedFiles.entrySet().stream().flatMap(entry -> {
+  Set filesSet = entry.getValue().keySet();
+  return filesSet.stream().map(file -> Pair.of(entry.getKey(), file));

Review Comment:
   ```suggestion
 return entry.getValue().keySet().stream().map(file -> 
Pair.of(entry.getKey(), file));
   ```



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -850,59 +851,58 @@ public static HoodieData 
convertFilesToBloomFilterRecords(HoodieEn
   
String instantTime) {
 HoodieData allRecordsRDD = engineContext.emptyHoodieData();
 
-List>> partitionToDeletedFilesList = 
partitionToDeletedFiles.entrySet()
-.stream().map(e -> Pair.of(e.getKey(), 
e.getValue())).collect(Collectors.toList());
-int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), 
recordsGenerationParams.getBloomIndexParallelism()), 1);
-HoodieData>> partitionToDeletedFilesRDD = 
engineContext.parallelize(partitionToDeletedFilesList, parallelism);
+List> partitionToDeletedFilesList = 
partitionToDeletedFiles.entrySet().stream().flatMap(entry -> {
+  List filesList = entry.getValue();
+  return filesList.stream().map(file -> Pair.of(entry.getKey(), file));

Review Comment:
   ```suggestion
 return entry.getValue().stream().map(file -> Pair.of(entry.getKey(), 
file));
   ```



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -927,48 +927,44 @@ public static HoodieData 
convertFilesToColumnStatsRecords(HoodieEn
   return engineContext.emptyHoodieData();
 }
 
-final List>> partitionToDeletedFilesList = 
partitionToDeletedFiles.entrySet().stream()
-.map(e -> Pair.of(e.getKey(), e.getValue()))
-.collect(Collectors.toList());
+List> partitionToDeletedFilesList =

[GitHub] [hudi] hudi-bot commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9106:
URL: https://github.com/apache/hudi/pull/9106#issuecomment-1640565751

   
   ## CI report:
   
   * 16ae34ec0e91811bae11a980749f5b77d048adba Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18519)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18646)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18663)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9223:
URL: https://github.com/apache/hudi/pull/9223#issuecomment-1640556024

   
   ## CI report:
   
   * 80179dfbcb1179d93d6aaced4f956db72363a347 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18662)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9203:
URL: https://github.com/apache/hudi/pull/9203#issuecomment-1640555900

   
   ## CI report:
   
   * 585935c37efc35994dd721ba2d8f05c9cf775470 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18667)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18665)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18669)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6557) Deletes are not working if using custom timestamp for replica identity DEFAULT

2023-07-18 Thread Aditya Goenka (Jira)

Aditya Goenka created HUDI-6557:
---

 Summary: Deletes are not working if using custom timestamp for 
replica identity DEFAULT
 Key: HUDI-6557
 URL: https://issues.apache.org/jira/browse/HUDI-6557
 Project: Apache Hudi
  Issue Type: Bug
  Components: deltastreamer
Reporter: Aditya Goenka
 Fix For: 0.14.0


Debezium is giving only primary key value for DELETE records and for others 
it's giving as null.or 0. Timestamp convertor then converts 0 to 1970-01-01 and 
tries to delete the record from that partition and delete fails.

Github issue - [https://github.com/apache/hudi/issues/9143]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-6411) Make SQL parameters case insensitive

2023-07-18 Thread Aditya Goenka (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Goenka closed HUDI-6411.
---
Resolution: Fixed

> Make SQL parameters case insensitive 
> -
>
> Key: HUDI-6411
> URL: https://issues.apache.org/jira/browse/HUDI-6411
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Aditya Goenka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0, 0.15.0
>
>
> Users should give spark sql parameters(like - recordKey, precombineField) 
> with any  case , and we should be able to parse it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-6411) Make SQL parameters case insensitive

2023-07-18 Thread Aditya Goenka (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744285#comment-17744285
 ] 

Aditya Goenka commented on HUDI-6411:
-

The PR is merged. Closing the JIRA.

> Make SQL parameters case insensitive 
> -
>
> Key: HUDI-6411
> URL: https://issues.apache.org/jira/browse/HUDI-6411
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Aditya Goenka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0, 0.15.0
>
>
> Users should give spark sql parameters(like - recordKey, precombineField) 
> with any  case , and we should be able to parse it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6411) Make SQL parameters case insensitive

2023-07-18 Thread Aditya Goenka (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Goenka updated HUDI-6411:

Status: Patch Available  (was: In Progress)

> Make SQL parameters case insensitive 
> -
>
> Key: HUDI-6411
> URL: https://issues.apache.org/jira/browse/HUDI-6411
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Aditya Goenka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0, 0.15.0
>
>
> Users should give spark sql parameters(like - recordKey, precombineField) 
> with any  case , and we should be able to parse it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-6411) Make SQL parameters case insensitive

2023-07-18 Thread Aditya Goenka (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Goenka resolved HUDI-6411.
-

> Make SQL parameters case insensitive 
> -
>
> Key: HUDI-6411
> URL: https://issues.apache.org/jira/browse/HUDI-6411
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: Aditya Goenka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.1.0, 0.15.0
>
>
> Users should give spark sql parameters(like - recordKey, precombineField) 
> with any  case , and we should be able to parse it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9203:
URL: https://github.com/apache/hudi/pull/9203#issuecomment-1640493692

   
   ## CI report:
   
   * 585935c37efc35994dd721ba2d8f05c9cf775470 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18667)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18665)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18669)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table

2023-07-18 Thread via GitHub



hudi-bot commented on PR #8856:
URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640492244

   
   ## CI report:
   
   * 2d4e285ba5ef3c5b07ec91af6ab3a2669d2b485d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17565)
 
   * 8b6fba8468a155d39a66dc57acb6ac8c5e29b294 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18668)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch asf-site updated: [HUDI-6520] [DOCS] Rename Deltastreamer and related classes and configs (#9179)

2023-07-18 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 6d5a4e2a6a7 [HUDI-6520] [DOCS] Rename Deltastreamer and related 
classes and configs (#9179)
6d5a4e2a6a7 is described below

commit 6d5a4e2a6a71f9f89f901169107e23665d034440
Author: Amrish Lal 
AuthorDate: Tue Jul 18 08:48:22 2023 -0700

[HUDI-6520] [DOCS] Rename Deltastreamer and related classes and configs 
(#9179)

Co-authored-by: Y Ethan Guo 
---
 website/docs/clustering.md|  10 +-
 website/docs/compaction.md|   8 +-
 website/docs/concurrency_control.md   |  20 ++--
 website/docs/deployment.md|  18 +--
 website/docs/docker_demo.md   |  26 ++--
 website/docs/faq.md   |  24 ++--
 website/docs/gcp_bigquery.md  |  10 +-
 website/docs/hoodie_deltastreamer.md  | 163 ++
 website/docs/key_generation.md|  76 ++--
 website/docs/metadata_indexing.md |  14 +--
 website/docs/metrics.md   |   4 +-
 website/docs/migration_guide.md   |   6 +-
 website/docs/precommit_validator.md   |   2 +-
 website/docs/querying_data.md |   2 +-
 website/docs/quick-start-guide.md |   2 +-
 website/docs/s3_hoodie.md |   2 +-
 website/docs/syncing_aws_glue_data_catalog.md |   2 +-
 website/docs/syncing_datahub.md   |  10 +-
 website/docs/syncing_metastore.md |   2 +-
 website/docs/transforms.md|   6 +-
 website/docs/use_cases.md |   2 +-
 website/docs/write_operations.md  |   2 +-
 website/docs/writing_data.md  |   2 +-
 23 files changed, 213 insertions(+), 200 deletions(-)

diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index d2ceb196d02..8eb0dfbfaa1 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -283,17 +283,17 @@ 
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run
 hoodie.clustering.plan.strategy.sort.columns=column1,column2
 ```
 
-### HoodieDeltaStreamer
+### HoodieStreamer
 
-This brings us to our users' favorite utility in Hudi. Now, we can trigger 
asynchronous clustering with DeltaStreamer.
+This brings us to our users' favorite utility in Hudi. Now, we can trigger 
asynchronous clustering with Hudi Streamer.
 Just set the `hoodie.clustering.async.enabled` config to true and specify 
other clustering config in properties file
-whose location can be pased as `—props` when starting the deltastreamer (just 
like in the case of HoodieClusteringJob).
+whose location can be pased as `—props` when starting the Hudi Streamer (just 
like in the case of HoodieClusteringJob).
 
-A sample spark-submit command to setup HoodieDeltaStreamer is as below:
+A sample spark-submit command to setup HoodieStreamer is as below:
 
 ```bash
 spark-submit \
---class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
+--class org.apache.hudi.utilities.streamer.HoodieStreamer \
 
/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.9.0-SNAPSHOT.jar
 \
 --props /path/to/config/clustering_kafka.properties \
 --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider 
\
diff --git a/website/docs/compaction.md b/website/docs/compaction.md
index a6249b7ae7c..9f7b119db43 100644
--- a/website/docs/compaction.md
+++ b/website/docs/compaction.md
@@ -45,14 +45,14 @@ import org.apache.spark.sql.streaming.ProcessingTime;
  writer.trigger(new ProcessingTime(3)).start(tablePath);
 ```
 
-### DeltaStreamer Continuous Mode
-Hudi DeltaStreamer provides continuous ingestion mode where a single long 
running spark application  
+### Hudi Streamer Continuous Mode
+Hudi Streamer provides continuous ingestion mode where a single long running 
spark application  
 ingests data to Hudi table continuously from upstream sources. In this mode, 
Hudi supports managing asynchronous
 compactions. Here is an example snippet for running in continuous mode with 
async compactions
 
 ```properties
 spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \
---class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
+--class org.apache.hudi.utilities.streamer.HoodieStreamer \
 --table-type MERGE_ON_READ \
 --target-base-path  \
 --target-table  \
@@ -76,7 +76,7 @@ you may want Synchronous compaction, which means that as a 
commit is written it
 
 Compaction is run synchronously by passing the flag "--disable-compaction" 
(Meaning to disable async compaction scheduling).
 When both ingestion and compaction is running in the same spark context, you 
can use resource allocation configuration 
-in DeltaStreamer CLI s

[GitHub] [hudi] yihua merged pull request #9179: [HUDI-6520] [DOCS] Rename Deltastreamer and related classes and configs

2023-07-18 Thread via GitHub



yihua merged PR #9179:
URL: https://github.com/apache/hudi/pull/9179


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6555) Big Query Sync failing with Class Not Found Exception for DeleteRecord

2023-07-18 Thread Aditya Goenka (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Goenka updated HUDI-6555:

Priority: Blocker  (was: Major)

> Big Query Sync failing with Class Not Found Exception for DeleteRecord
> --
>
> Key: HUDI-6555
> URL: https://issues.apache.org/jira/browse/HUDI-6555
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: meta-sync
>Reporter: Aditya Goenka
>Priority: Blocker
> Fix For: 0.14.0
>
>
> With version 0.13 , BQ sync is failing with error `Caused by: 
> com.esotericsoftware.kryo.KryoException: Unable to find class: 
> [Lorg.apache.hudi.common.model.DeleteRecord;` during KryoSerialization phase.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-6556) Big Query sync with master code failing for partitioned table with the Exception

2023-07-18 Thread Aditya Goenka (Jira)

Aditya Goenka created HUDI-6556:
---

 Summary: Big Query sync with master code failing for partitioned 
table with the Exception
 Key: HUDI-6556
 URL: https://issues.apache.org/jira/browse/HUDI-6556
 Project: Apache Hudi
  Issue Type: Bug
  Components: meta-sync
Reporter: Aditya Goenka


While doing Big Query Sync for partitioned table, its failing with below 
Exception - 

error message: Failed to add partition key partitionpath (type: TYPE_STRING) to 
schema, because another column with the same name was already present. This is 
not allowed. Full partition schema: [partitionpath:TYPE_STRING]."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9203:
URL: https://github.com/apache/hudi/pull/9203#issuecomment-1640477276

   
   ## CI report:
   
   * 585935c37efc35994dd721ba2d8f05c9cf775470 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18667)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18665)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6555) Big Query Sync failing with Class Not Found Exception for DeleteRecord

2023-07-18 Thread Aditya Goenka (Jira)

Aditya Goenka created HUDI-6555:
---

 Summary: Big Query Sync failing with Class Not Found Exception for 
DeleteRecord
 Key: HUDI-6555
 URL: https://issues.apache.org/jira/browse/HUDI-6555
 Project: Apache Hudi
  Issue Type: Bug
  Components: meta-sync
Reporter: Aditya Goenka
 Fix For: 0.14.0


With version 0.13 , BQ sync is failing with error `Caused by: 
com.esotericsoftware.kryo.KryoException: Unable to find class: 
[Lorg.apache.hudi.common.model.DeleteRecord;` during KryoSerialization phase.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table

2023-07-18 Thread via GitHub



hudi-bot commented on PR #8856:
URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640475603

   
   ## CI report:
   
   * 2d4e285ba5ef3c5b07ec91af6ab3a2669d2b485d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17565)
 
   * 8b6fba8468a155d39a66dc57acb6ac8c5e29b294 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9203:
URL: https://github.com/apache/hudi/pull/9203#issuecomment-1640457527

   
   ## CI report:
   
   * 585935c37efc35994dd721ba2d8f05c9cf775470 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18665)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18667)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] KnightChess commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table

2023-07-18 Thread via GitHub



KnightChess commented on PR #8856:
URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640444082

   @yihua @codope @nsivabalan  can you help review it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] KnightChess commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table

2023-07-18 Thread via GitHub



KnightChess commented on PR #8856:
URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640440727

   rebase master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-6544] Remove unnecessary merge for bootstrap files in merge helper (#9216)

2023-07-18 Thread codope

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new be4dfccbb24 [HUDI-6544] Remove unnecessary merge for bootstrap files 
in merge helper (#9216)
be4dfccbb24 is described below

commit be4dfccbb24794dfac3714818971229870d24a2c
Author: Jon Vexler 
AuthorDate: Tue Jul 18 11:20:57 2023 -0400

[HUDI-6544] Remove unnecessary merge for bootstrap files in merge helper 
(#9216)
---
 .../hudi/table/action/commit/HoodieMergeHelper.java   | 15 ---
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java
 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java
index 893ee3fc032..4df767b5e41 100644
--- 
a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java
+++ 
b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java
@@ -18,7 +18,6 @@
 
 package org.apache.hudi.table.action.commit;
 
-import org.apache.hudi.client.utils.ClosableMergingIterator;
 import org.apache.hudi.common.config.HoodieCommonConfig;
 import org.apache.hudi.common.model.HoodieBaseFile;
 import org.apache.hudi.common.model.HoodieRecord;
@@ -109,11 +108,6 @@ public class HoodieMergeHelper extends BaseMergeHelper {
 
 try {
   ClosableIterator recordIterator;
-
-  // In case writer's schema is simply a projection of the reader's one we 
can read
-  // the records in the projected schema directly
-  ClosableIterator baseFileRecordIterator =
-  baseFileReader.getRecordIterator(isPureProjection ? writerSchema : 
readerSchema);
   Schema recordSchema;
   if (baseFile.getBootstrapBaseFile().isPresent()) {
 Path bootstrapFilePath = new 
Path(baseFile.getBootstrapBaseFile().get().getPath());
@@ -124,13 +118,12 @@ public class HoodieMergeHelper extends BaseMergeHelper 
{
 mergeHandle.getPartitionFields(),
 mergeHandle.getPartitionValues());
 recordSchema = mergeHandle.getWriterSchemaWithMetaFields();
-recordIterator = new ClosableMergingIterator<>(
-baseFileRecordIterator,
-(ClosableIterator) 
bootstrapFileReader.getRecordIterator(recordSchema),
-(left, right) -> left.joinWith(right, recordSchema));
+recordIterator = (ClosableIterator) 
bootstrapFileReader.getRecordIterator(recordSchema);
   } else {
-recordIterator = baseFileRecordIterator;
+// In case writer's schema is simply a projection of the reader's one 
we can read
+// the records in the projected schema directly
 recordSchema = isPureProjection ? writerSchema : readerSchema;
+recordIterator = baseFileReader.getRecordIterator(recordSchema);
   }
 
   boolean isBufferingRecords = 
ExecutorFactory.isBufferingRecords(writeConfig);

[GitHub] [hudi] KnightChess commented on a diff in pull request #9212: [HUDI-6541] Multiple writers should create new and different instant time to avoid marker conflict of same instant

2023-07-18 Thread via GitHub



KnightChess commented on code in PR #9212:
URL: https://github.com/apache/hudi/pull/9212#discussion_r1266922879


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -862,11 +866,29 @@ public String startCommit(String actionType, 
HoodieTableMetaClient metaClient) {
 CleanerUtils.rollbackFailedWrites(config.getFailedWritesCleanPolicy(),
 HoodieTimeline.COMMIT_ACTION, () -> 
tableServiceClient.rollbackFailedWrites());
 
-String instantTime = HoodieActiveTimeline.createNewInstantTime();
+String instantTime = createCommit();

Review Comment:
   Agree, the cost of lock is too high. when we fill back the history 
partitions in diff job, it will cost a lot of time to obtain it



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -862,11 +866,29 @@ public String startCommit(String actionType, 
HoodieTableMetaClient metaClient) {
 CleanerUtils.rollbackFailedWrites(config.getFailedWritesCleanPolicy(),
 HoodieTimeline.COMMIT_ACTION, () -> 
tableServiceClient.rollbackFailedWrites());
 
-String instantTime = HoodieActiveTimeline.createNewInstantTime();
+String instantTime = createCommit();
 startCommit(instantTime, actionType, metaClient);
 return instantTime;
   }
 
+  /**
+   * Creates a new commit time for a write operation 
(insert/update/delete/insert_overwrite/insert_overwrite_table).
+   *
+   * @return Instant time to be generated.
+   */
+  public String createCommit() {
+if 
(config.getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) {
+  try {
+lockManager.lock();
+return HoodieActiveTimeline.createNewInstantTime();

Review Comment:
   Some other table services will use this method directly, there may be 
similar problems



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.

2023-07-18 Thread via GitHub



hudi-bot commented on PR #9203:
URL: https://github.com/apache/hudi/pull/9203#issuecomment-1640372277

   
   ## CI report:
   
   * 585935c37efc35994dd721ba2d8f05c9cf775470 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 151 matches

Mail list logo