[GitHub] [hudi] hudi-bot commented on pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
hudi-bot commented on PR #8837: URL: https://github.com/apache/hudi/pull/8837#issuecomment-1641508668 ## CI report: * fa5c3f22ad50c6bdf4cf8fa04f51ecfba1cd8905 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18677) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] amrishlal commented on a diff in pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.
amrishlal commented on code in PR #9203: URL: https://github.com/apache/hudi/pull/9203#discussion_r1267608249 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTableWithNonRecordKeyField.scala: ## @@ -22,122 +22,128 @@ import org.apache.hudi.{HoodieSparkUtils, ScalaAssertionSupport} class TestMergeIntoTableWithNonRecordKeyField extends HoodieSparkSqlTestBase with ScalaAssertionSupport { test("Test Merge into extra cond") { -withTempDir { tmp => - val tableName = generateTableName - spark.sql( -s""" - |create table $tableName ( - | id int, - | name string, - | price double, - | ts long - |) using hudi - | location '${tmp.getCanonicalPath}/$tableName' - | tblproperties ( - | primaryKey ='id', - | preCombineField = 'ts' - | ) +Seq(true, false).foreach { optimizedSqlEnabled => + withTempDir { tmp => +val tableName = generateTableName +spark.sql( + s""" + |create table $tableName ( + | id int, + | name string, + | price double, + | ts long + |) using hudi + | location '${tmp.getCanonicalPath}/$tableName' + | tblproperties ( + | primaryKey ='id', + | preCombineField = 'ts' + | ) """.stripMargin) - val tableName2 = generateTableName - spark.sql( -s""" - |create table $tableName2 ( - | id int, - | name string, - | price double, - | ts long - |) using hudi - | location '${tmp.getCanonicalPath}/$tableName2' - | tblproperties ( - | primaryKey ='id', - | preCombineField = 'ts' - | ) +val tableName2 = generateTableName +spark.sql( + s""" + |create table $tableName2 ( + | id int, + | name string, + | price double, + | ts long + |) using hudi + | location '${tmp.getCanonicalPath}/$tableName2' + | tblproperties ( + | primaryKey ='id', + | preCombineField = 'ts' + | ) """.stripMargin) - spark.sql( -s""" - |insert into $tableName values - |(1, 'a1', 10, 100), - |(2, 'a2', 20, 200), - |(3, 'a3', 20, 100) - |""".stripMargin) - spark.sql( -s""" - |insert into $tableName2 values - |(1, 'u1', 10, 999), - |(3, 'u3', 30, ), - |(4, 'u4', 40, 9) - |""".stripMargin) +spark.sql( + s""" + |insert into $tableName values + |(1, 'a1', 10, 100), + |(2, 'a2', 20, 200), + |(3, 'a3', 20, 100) + |""".stripMargin) +spark.sql( + s""" + |insert into $tableName2 values + |(1, 'u1', 10, 999), + |(3, 'u3', 30, ), + |(4, 'u4', 40, 9) + |""".stripMargin) - spark.sql( -s""" - |merge into $tableName as oldData - |using $tableName2 - |on oldData.id = $tableName2.id - |when matched and oldData.price = $tableName2.price then update set oldData.name = $tableName2.name - | - |""".stripMargin) +// test with optimized sql merge enabled / disabled. +spark.sql(s"set hoodie.spark.sql.optimized.merge.enable=$optimizedSqlEnabled") - checkAnswer(s"select id, name, price, ts from $tableName")( -Seq(1, "u1", 10.0, 100), -Seq(3, "a3", 20.0, 100), -Seq(2, "a2", 20.0, 200) - ) +spark.sql( + s""" + |merge into $tableName as oldData + |using $tableName2 + |on oldData.id = $tableName2.id + |when matched and oldData.price = $tableName2.price then update set oldData.name = $tableName2.name + | + |""".stripMargin) - val errorMessage = if (HoodieSparkUtils.gteqSpark3_1) { -"Only simple conditions of the form `t.id = s.id` using primary key or partition path " + - "columns are allowed on tables with primary key. (illegal column(s) used: `price`" - } else { -"Only simple conditions of the form `t.id = s.id` using primary key or partition path " + - "columns are allowed on tables with primary key. (illegal column(s) used: `price`;" - } +checkAnswer(s"select id, name, price, ts from $tableName")( + Seq(1, "u1", 10.0, 100), + Seq(3, "a3", 20.0, 100), + Seq(2, "a2", 20.0, 200) +) - checkException( -s""" - |merge into $tableName as
[GitHub] [hudi] amrishlal commented on a diff in pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.
amrishlal commented on code in PR #9203: URL: https://github.com/apache/hudi/pull/9203#discussion_r1267607979 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala: ## @@ -642,11 +642,11 @@ object DataSourceWriteOptions { val DROP_PARTITION_COLUMNS: ConfigProperty[java.lang.Boolean] = HoodieTableConfig.DROP_PARTITION_COLUMNS val ENABLE_OPTIMIZED_SQL_WRITES: ConfigProperty[String] = ConfigProperty -.key("hoodie.spark.sql.writes.optimized.enable") +.key("hoodie.spark.sql.optimized.writes.enable") Review Comment: Fixed. ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala: ## @@ -146,9 +144,7 @@ class DefaultSource extends RelationProvider mode: SaveMode, optParams: Map[String, String], rawDf: DataFrame): BaseRelation = { -val df = if (optParams.getOrDefault(DATASOURCE_WRITE_PREPPED_KEY, - optParams.getOrDefault(SQL_MERGE_INTO_WRITES.key(), SQL_MERGE_INTO_WRITES.defaultValue().toString)) - .equalsIgnoreCase("true")) { +val df = if (optParams.getOrDefault(DATASOURCE_WRITE_PREPPED_KEY, "false").toBoolean || optParams.getOrDefault(WRITE_PREPPED_MERGE_KEY, "false").toBoolean) { Review Comment: Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] amrishlal commented on a diff in pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.
amrishlal commented on code in PR #9203: URL: https://github.com/apache/hudi/pull/9203#discussion_r1267607758 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -721,6 +721,8 @@ public class HoodieWriteConfig extends HoodieConfig { + "The class must be a subclass of `org.apache.hudi.callback.HoodieClientInitCallback`." + "By default, no Hudi client init callback is executed."); + public static final String WRITE_PREPPED_MERGE_KEY = "_hoodie.datasource.merge.into.prepped"; Review Comment: Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8452: [HUDI-6077] Add more partition push down filters
hudi-bot commented on PR #8452: URL: https://github.com/apache/hudi/pull/8452#issuecomment-1641498996 ## CI report: * 8082df232089396b2a9f9be2b915e51b3645f172 UNKNOWN * 66d853918fe311dbc1d889297aab5277833b5c3b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18651) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18654) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18658) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18680) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.
hudi-bot commented on PR #9203: URL: https://github.com/apache/hudi/pull/9203#issuecomment-1641493001 ## CI report: * 585935c37efc35994dd721ba2d8f05c9cf775470 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18667) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18665) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18669) * 539cad2f3b8edde2211bd8ddeeb3feec15cd6e94 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18679) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on pull request #8452: [HUDI-6077] Add more partition push down filters
boneanxs commented on PR #8452: URL: https://github.com/apache/hudi/pull/8452#issuecomment-1641477916 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6561) Ensure there is no data duplication with spark streaming writes
sivabalan narayanan created HUDI-6561: - Summary: Ensure there is no data duplication with spark streaming writes Key: HUDI-6561 URL: https://issues.apache.org/jira/browse/HUDI-6561 Project: Apache Hudi Issue Type: Improvement Components: spark Reporter: sivabalan narayanan w/ spark-streaming writes, we can deduce first batch using batchId vs an existing batch which got resumed after a long long time. we should guarantee idempotency by deducing the batch Id -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.
hudi-bot commented on PR #9203: URL: https://github.com/apache/hudi/pull/9203#issuecomment-1641450952 ## CI report: * 585935c37efc35994dd721ba2d8f05c9cf775470 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18667) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18665) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18669) * 539cad2f3b8edde2211bd8ddeeb3feec15cd6e94 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9226: [HUDI-6352] take actual commit time (StateTransitionTime) into consid…
hudi-bot commented on PR #9226: URL: https://github.com/apache/hudi/pull/9226#issuecomment-1641451034 ## CI report: * c74087e82eb4bec52b33a07679d2ecbc3aba43c9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18676) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
codope commented on code in PR #9223: URL: https://github.com/apache/hudi/pull/9223#discussion_r1267505437 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -405,6 +405,9 @@ private boolean initializeFromFilesystem(String initializationTime, List convertFilesToBloomFilterRecords(HoodieEn Map> partitionToAppendedFiles, MetadataRecordsGenerationParams recordsGenerationParams, String instantTime) { -HoodieData allRecordsRDD = engineContext.emptyHoodieData(); - -List>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet() -.stream().map(e -> Pair.of(e.getKey(), e.getValue())).collect(Collectors.toList()); -int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); -HoodieData>> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList, parallelism); - -HoodieData deletedFilesRecordsRDD = partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> { - final String partitionName = partitionToDeletedFilesPair.getLeft(); - final List deletedFileList = partitionToDeletedFilesPair.getRight(); - return deletedFileList.stream().flatMap(deletedFile -> { -if (!FSUtils.isBaseFile(new Path(deletedFile))) { - return Stream.empty(); -} - -final String partition = getPartitionIdentifier(partitionName); -return Stream.of(HoodieMetadataPayload.createBloomFilterMetadataRecord( -partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, ByteBuffer.allocate(0), true)); - }).iterator(); -}); -allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD); +// Total number of files which are added or deleted +final int totalFiles = partitionToDeletedFiles.values().stream().mapToInt(List::size).sum() ++ partitionToAppendedFiles.values().stream().mapToInt(Map::size).sum(); + +// Create the tuple (partition, filename, isDeleted) to handle both deletes and appends +final List> partitionFileFlagTupleList = new ArrayList<>(totalFiles); +partitionToDeletedFiles.entrySet().stream() +.flatMap(entry -> entry.getValue().stream().map(deletedFile -> new Tuple3<>(entry.getKey(), deletedFile, true))) +.collect(Collectors.toCollection(() -> partitionFileFlagTupleList)); +partitionToAppendedFiles.entrySet().stream() +.flatMap(entry -> entry.getValue().keySet().stream().map(addedFile -> new Tuple3<>(entry.getKey(), addedFile, false))) +.collect(Collectors.toCollection(() -> partitionFileFlagTupleList)); Review Comment: We can probably extract this tuple creation code to a separate method. Looks repetitive for both bloom filter and colstats. ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -915,65 +903,60 @@ public static HoodieData convertFilesToColumnStatsRecords(HoodieEn Map> partitionToDeletedFiles, Map> partitionToAppendedFiles, MetadataRecordsGenerationParams recordsGenerationParams) { -HoodieData allRecordsRDD = engineContext.emptyHoodieData(); +// Find the columns to index HoodieTableMetaClient dataTableMetaClient = recordsGenerationParams.getDataMetaClient(); - final List columnsToIndex = getColumnsToIndex(recordsGenerationParams, Lazy.lazily(() -> tryResolveSchemaForTable(dataTableMetaClient))); - if (columnsToIndex.isEmpty()) { // In case there are no columns to index, bail return engineContext.emptyHoodieData(); } -final List>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet().stream() -.map(e -> Pair.of(e.getKey(), e.getValue())) -.collect(Collectors.toList()); - -int deletedFilesTargetParallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), recordsGenerationParams.getColumnStatsIndexParallelism()), 1); -final HoodieData>> partitionToDeletedFilesRDD = -engineContext.parallelize(partitionToDeletedFilesList, deletedFilesTargetParallelism); - -HoodieData deletedFilesRecordsRDD = partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> { - final String partitionPath = partitionToDeletedFilesPair.getLeft(); - final String partitionId = getPartitionIdentifier(partitionPath); - final List deletedFileList = partitionToDeletedFilesPair.getRight(); - - return deletedFileList.stream().flatMa
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.
nsivabalan commented on code in PR #9203: URL: https://github.com/apache/hudi/pull/9203#discussion_r1267546687 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala: ## @@ -642,11 +642,11 @@ object DataSourceWriteOptions { val DROP_PARTITION_COLUMNS: ConfigProperty[java.lang.Boolean] = HoodieTableConfig.DROP_PARTITION_COLUMNS val ENABLE_OPTIMIZED_SQL_WRITES: ConfigProperty[String] = ConfigProperty -.key("hoodie.spark.sql.writes.optimized.enable") +.key("hoodie.spark.sql.optimized.writes.enable") Review Comment: generally we try to align the var naming to the key. Lets name the variable SPARK_SQL_OPTIMIZED_WRITES ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -721,6 +721,8 @@ public class HoodieWriteConfig extends HoodieConfig { + "The class must be a subclass of `org.apache.hudi.callback.HoodieClientInitCallback`." + "By default, no Hudi client init callback is executed."); + public static final String WRITE_PREPPED_MERGE_KEY = "_hoodie.datasource.merge.into.prepped"; Review Comment: can we add java docs calling out the purpose of this ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTableWithNonRecordKeyField.scala: ## @@ -22,122 +22,128 @@ import org.apache.hudi.{HoodieSparkUtils, ScalaAssertionSupport} class TestMergeIntoTableWithNonRecordKeyField extends HoodieSparkSqlTestBase with ScalaAssertionSupport { test("Test Merge into extra cond") { -withTempDir { tmp => - val tableName = generateTableName - spark.sql( -s""" - |create table $tableName ( - | id int, - | name string, - | price double, - | ts long - |) using hudi - | location '${tmp.getCanonicalPath}/$tableName' - | tblproperties ( - | primaryKey ='id', - | preCombineField = 'ts' - | ) +Seq(true, false).foreach { optimizedSqlEnabled => + withTempDir { tmp => +val tableName = generateTableName +spark.sql( + s""" + |create table $tableName ( + | id int, + | name string, + | price double, + | ts long + |) using hudi + | location '${tmp.getCanonicalPath}/$tableName' + | tblproperties ( + | primaryKey ='id', + | preCombineField = 'ts' + | ) """.stripMargin) - val tableName2 = generateTableName - spark.sql( -s""" - |create table $tableName2 ( - | id int, - | name string, - | price double, - | ts long - |) using hudi - | location '${tmp.getCanonicalPath}/$tableName2' - | tblproperties ( - | primaryKey ='id', - | preCombineField = 'ts' - | ) +val tableName2 = generateTableName +spark.sql( + s""" + |create table $tableName2 ( + | id int, + | name string, + | price double, + | ts long + |) using hudi + | location '${tmp.getCanonicalPath}/$tableName2' + | tblproperties ( + | primaryKey ='id', + | preCombineField = 'ts' + | ) """.stripMargin) - spark.sql( -s""" - |insert into $tableName values - |(1, 'a1', 10, 100), - |(2, 'a2', 20, 200), - |(3, 'a3', 20, 100) - |""".stripMargin) - spark.sql( -s""" - |insert into $tableName2 values - |(1, 'u1', 10, 999), - |(3, 'u3', 30, ), - |(4, 'u4', 40, 9) - |""".stripMargin) +spark.sql( + s""" + |insert into $tableName values + |(1, 'a1', 10, 100), + |(2, 'a2', 20, 200), + |(3, 'a3', 20, 100) + |""".stripMargin) +spark.sql( + s""" + |insert into $tableName2 values + |(1, 'u1', 10, 999), + |(3, 'u3', 30, ), + |(4, 'u4', 40, 9) + |""".stripMargin) - spark.sql( -s""" - |merge into $tableName as oldData - |using $tableName2 - |on oldData.id = $tableName2.id - |when matched and oldData.price = $tableName2.price then update set oldData.name = $tableName2.name - | - |""".stripMargin) +// test with optimized sql merge enabled / disabled. +spark.sql(s"set hoodie.spark.sql.optimized.merge.enable=$optimizedSqlEnabled") - checkAnswer(s"select id, name, price, ts from $tableName")( -Seq(1, "u1", 10.0, 100), -
[GitHub] [hudi] amrishlal commented on a diff in pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.
amrishlal commented on code in PR #9203: URL: https://github.com/apache/hudi/pull/9203#discussion_r1267531685 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -721,6 +721,8 @@ public class HoodieWriteConfig extends HoodieConfig { + "The class must be a subclass of `org.apache.hudi.callback.HoodieClientInitCallback`." + "By default, no Hudi client init callback is executed."); + public static final String WRITE_PREPPED_MERGE_KEY = "_hoodie.datasource.merge.prepped"; + Review Comment: Fixed. ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala: ## @@ -642,11 +642,18 @@ object DataSourceWriteOptions { val DROP_PARTITION_COLUMNS: ConfigProperty[java.lang.Boolean] = HoodieTableConfig.DROP_PARTITION_COLUMNS val ENABLE_OPTIMIZED_SQL_WRITES: ConfigProperty[String] = ConfigProperty -.key("hoodie.spark.sql.writes.optimized.enable") +.key("hoodie.spark.sql.optimized.writes.enable") .defaultValue("true") .markAdvanced() .sinceVersion("0.14.0") -.withDocumentation("Controls whether spark sql optimized update is enabled.") +.withDocumentation("Controls whether spark sql prepped update and delete is enabled.") + + val ENABLE_OPTIMIZED_SQL_MERGE_WRITES: ConfigProperty[String] = ConfigProperty Review Comment: Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9227: [HUDI-6560] Avoid to read instant details 2 times for archiving
hudi-bot commented on PR #9227: URL: https://github.com/apache/hudi/pull/9227#issuecomment-1641371506 ## CI report: * 1c756c1c634bb1db2bdcbd7eca3a045f4ea99a5b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18678) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
hudi-bot commented on PR #8837: URL: https://github.com/apache/hudi/pull/8837#issuecomment-1641371072 ## CI report: * 2f9aa542076faa188839bc55b43dd7f22ec32b62 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18580) * fa5c3f22ad50c6bdf4cf8fa04f51ecfba1cd8905 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18677) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9227: [HUDI-6560] Avoid to read instant details 2 times for archiving
hudi-bot commented on PR #9227: URL: https://github.com/apache/hudi/pull/9227#issuecomment-1641367288 ## CI report: * 1c756c1c634bb1db2bdcbd7eca3a045f4ea99a5b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
hudi-bot commented on PR #8837: URL: https://github.com/apache/hudi/pull/8837#issuecomment-1641366845 ## CI report: * 2f9aa542076faa188839bc55b43dd7f22ec32b62 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18580) * fa5c3f22ad50c6bdf4cf8fa04f51ecfba1cd8905 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6560) Avoid to read instant details 2 times for archiving
[ https://issues.apache.org/jira/browse/HUDI-6560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6560: - Labels: pull-request-available (was: ) > Avoid to read instant details 2 times for archiving > --- > > Key: HUDI-6560 > URL: https://issues.apache.org/jira/browse/HUDI-6560 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] danny0405 opened a new pull request, #9227: [HUDI-6560] Avoid to read instant details 2 times for archiving
danny0405 opened a new pull request, #9227: URL: https://github.com/apache/hudi/pull/9227 ### Change Logs 1. only load the instant deatils once for each instant 2. do not store the plan for inflight instans, such as compaction, log_compaction, clustering, etc. ### Impact none ### Risk level (write none, low medium or high below) none ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.
nsivabalan commented on code in PR #9203: URL: https://github.com/apache/hudi/pull/9203#discussion_r1267481558 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -721,6 +721,8 @@ public class HoodieWriteConfig extends HoodieConfig { + "The class must be a subclass of `org.apache.hudi.callback.HoodieClientInitCallback`." + "By default, no Hudi client init callback is executed."); + public static final String WRITE_PREPPED_MERGE_KEY = "_hoodie.datasource.merge.prepped"; + Review Comment: I am also thinking, from a user standpoint we should have just 1 config to enable or disable the optimized flow (irrespective of whether its mIT or updates or deletes). but internally we can use diff configs if we wish to differentiate MIT and rest. ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala: ## @@ -642,11 +642,18 @@ object DataSourceWriteOptions { val DROP_PARTITION_COLUMNS: ConfigProperty[java.lang.Boolean] = HoodieTableConfig.DROP_PARTITION_COLUMNS val ENABLE_OPTIMIZED_SQL_WRITES: ConfigProperty[String] = ConfigProperty -.key("hoodie.spark.sql.writes.optimized.enable") +.key("hoodie.spark.sql.optimized.writes.enable") .defaultValue("true") .markAdvanced() .sinceVersion("0.14.0") -.withDocumentation("Controls whether spark sql optimized update is enabled.") +.withDocumentation("Controls whether spark sql prepped update and delete is enabled.") + + val ENABLE_OPTIMIZED_SQL_MERGE_WRITES: ConfigProperty[String] = ConfigProperty Review Comment: I am also thinking, from a user standpoint we should have just 1 config to enable or disable the optimized flow (irrespective of whether its mIT or updates or deletes). but internally we can use diff configs if we wish to differentiate MIT and rest. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9226: [HUDI-6352] take actual commit time (StateTransitionTime) into consid…
hudi-bot commented on PR #9226: URL: https://github.com/apache/hudi/pull/9226#issuecomment-1641303819 ## CI report: * c74087e82eb4bec52b33a07679d2ecbc3aba43c9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18676) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] KnightChess commented on pull request #8856: [HUDI-6300] Fix file size parallelism not work when init metadata table
KnightChess commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1641296291 @yihua thanks review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9226: [HUDI-6352] take actual commit time (StateTransitionTime) into consid…
hudi-bot commented on PR #9226: URL: https://github.com/apache/hudi/pull/9226#issuecomment-1641295713 ## CI report: * c74087e82eb4bec52b33a07679d2ecbc3aba43c9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] KnightChess commented on pull request #8856: [HUDI-6300] Fix file size parallelism not work when init metadata table
KnightChess commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1641295662 > @KnightChess do you have any number on the performance improvement on updating MDT from this PR? parallelism compute: ```java parallelism = Math.max(Math.min(partitionToAppendedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); ``` @yihua in this picture, total file is more than 5000 with on partitions, before it fix, bloom and col stats parallelism is 1, and now, bloom filter is 200, col stat is 10, which from default value ![image](https://github.com/apache/hudi/assets/20125927/ff4cb9e4-d595-4294-83e7-cb42c73c40ff) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6560) Avoid to read instant details 2 times for archiving
Danny Chen created HUDI-6560: Summary: Avoid to read instant details 2 times for archiving Key: HUDI-6560 URL: https://issues.apache.org/jira/browse/HUDI-6560 Project: Apache Hudi Issue Type: Improvement Components: writer-core Reporter: Danny Chen Fix For: 0.14.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6352) KEEP_LATEST_BY_HOURS should consider modified time instead of commit time while setting earliestCommitToRetain value
[ https://issues.apache.org/jira/browse/HUDI-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6352: - Labels: pull-request-available (was: ) > KEEP_LATEST_BY_HOURS should consider modified time instead of commit time > while setting earliestCommitToRetain value > > > Key: HUDI-6352 > URL: https://issues.apache.org/jira/browse/HUDI-6352 > Project: Apache Hudi > Issue Type: Bug >Reporter: Surya Prasanna Yalla >Priority: Major > Labels: pull-request-available > > In CleanPlanner, KEEP_LATEST_BY_HOURS is setting earliestCommitToRetain value > by consider timestamp directly, this will introduce bug if there are out of > order commits where commit with lower timestamp is completed much later than > commits with higher timestamps. > This policy's implementation needs to be revisit. > It should basically store the timestamp until which it cleaned let this be > t1. Next cleaner instant should consider all the partitions and files that > are modified from the point of t1 until (currentime-x) hours. Whichever files > are not valid they should be removed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6300] Fix file size parallelism not work when init metadata table (#8856)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new bce55f0c165 [HUDI-6300] Fix file size parallelism not work when init metadata table (#8856) bce55f0c165 is described below commit bce55f0c1651949a1dfddaaf343d62cf76574063 Author: KnightChess <981159...@qq.com> AuthorDate: Wed Jul 19 10:26:13 2023 +0800 [HUDI-6300] Fix file size parallelism not work when init metadata table (#8856) Co-authored-by: Y Ethan Guo --- .../hudi/metadata/HoodieTableMetadataUtil.java | 140 ++--- 1 file changed, 66 insertions(+), 74 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java index cd87f6ff59c..56f478e781c 100644 --- a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java +++ b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java @@ -86,6 +86,7 @@ import java.util.HashSet; import java.util.LinkedList; import java.util.List; import java.util.Map; +import java.util.Objects; import java.util.Set; import java.util.function.BiFunction; import java.util.function.Function; @@ -850,59 +851,56 @@ public class HoodieTableMetadataUtil { String instantTime) { HoodieData allRecordsRDD = engineContext.emptyHoodieData(); -List>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet() -.stream().map(e -> Pair.of(e.getKey(), e.getValue())).collect(Collectors.toList()); -int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); -HoodieData>> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList, parallelism); +List> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet().stream().flatMap(entry -> { + return entry.getValue().stream().map(file -> Pair.of(entry.getKey(), file)); +}).collect(Collectors.toList()); -HoodieData deletedFilesRecordsRDD = partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> { - final String partitionName = partitionToDeletedFilesPair.getLeft(); - final List deletedFileList = partitionToDeletedFilesPair.getRight(); - return deletedFileList.stream().flatMap(deletedFile -> { -if (!FSUtils.isBaseFile(new Path(deletedFile))) { - return Stream.empty(); -} +int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); +HoodieData> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList, parallelism); -final String partition = getPartitionIdentifier(partitionName); -return Stream.of(HoodieMetadataPayload.createBloomFilterMetadataRecord( -partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, ByteBuffer.allocate(0), true)); - }).iterator(); -}); +HoodieData deletedFilesRecordsRDD = partitionToDeletedFilesRDD.map(partitionToDeletedFilePair -> { + String partitionName = partitionToDeletedFilePair.getLeft(); + String deletedFile = partitionToDeletedFilePair.getRight(); + if (!FSUtils.isBaseFile(new Path(deletedFile))) { +return null; + } + final String partition = getPartitionIdentifier(partitionName); + return (HoodieRecord) (HoodieMetadataPayload.createBloomFilterMetadataRecord( + partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, ByteBuffer.allocate(0), true)); +}).filter(Objects::nonNull); allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD); -List>> partitionToAppendedFilesList = partitionToAppendedFiles.entrySet() -.stream().map(entry -> Pair.of(entry.getKey(), entry.getValue())).collect(Collectors.toList()); +List> partitionToAppendedFilesList = partitionToAppendedFiles.entrySet().stream().flatMap(entry -> { + return entry.getValue().keySet().stream().map(file -> Pair.of(entry.getKey(), file)); +}).collect(Collectors.toList()); + parallelism = Math.max(Math.min(partitionToAppendedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); -HoodieData>> partitionToAppendedFilesRDD = engineContext.parallelize(partitionToAppendedFilesList, parallelism); +HoodieData> partitionToAppendedFilesRDD = engineContext.parallelize(partitionToAppendedFilesList, parallelism); -HoodieData appendedFilesRecordsRDD = partitionToAppendedFilesRDD.flatMap(partitionToAppendedFilesPair -> { - final String partitionName = partitionToAppendedFilesPair.getLeft(); - final Map appendedFileMap = partitionToAppendedFiles
[GitHub] [hudi] hbgstc123 opened a new pull request, #9226: [HUDI-6352] take actual commit time (StateTransitionTime) into consid…
hbgstc123 opened a new pull request, #9226: URL: https://github.com/apache/hudi/pull/9226 …eration when getting the oldest instant to retain for clustering from archival. According to the current logic of `ClusteringUtils#getOldestInstantToRetainForClustering`, if the timeline of a hoodie table is `replace1 commit2 clean3`, the earliestInstantToRetain of clean3 is commit2, then replace1 is considered ready for archival no matter when it is completed. But if replace1 is completed after clean3, then the replaced files in replace1 are not cleaned, so it should not be archived. This pr fix such case. ### Change Logs Add logic to `ClusteringUtils#getOldestInstantToRetainForClustering`, make sure a replace commit not archived if its actual complete time is later than the actual complete time of the latest completed clean instant. ### Impact none ### Risk level (write none, low medium or high below) low ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua merged pull request #8856: [HUDI-6300] Fix file size parallelism not work when init metadata table
yihua merged PR #8856: URL: https://github.com/apache/hudi/pull/8856 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6559) Add Sharing Group for Compaction
Bo Cui created HUDI-6559: Summary: Add Sharing Group for Compaction Key: HUDI-6559 URL: https://issues.apache.org/jira/browse/HUDI-6559 Project: Apache Hudi Issue Type: Improvement Components: flink Reporter: Bo Cui if compaction is enabled, compaction shares resources with the write operator. When compaction is under heavy pressure, the performance of the write operator is affected. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9224: seems to be working
hudi-bot commented on PR #9224: URL: https://github.com/apache/hudi/pull/9224#issuecomment-1641197813 ## CI report: * 558ee6903fe1985b41ad70205bf648a2b464fc38 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18674) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vijayasarathib opened a new pull request, #9225: Documentation change to Increase readability for basic_configurations
vijayasarathib opened a new pull request, #9225: URL: https://github.com/apache/hudi/pull/9225 Update basic_configurations.md ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] psendyk commented on issue #8890: [SUPPORT] Spark structured streaming ingestion into Hudi fails after an upgrade to 0.12.2
psendyk commented on issue #8890: URL: https://github.com/apache/hudi/issues/8890#issuecomment-1641148436 I tested it again using the options @zyclove posted above and the job still fails with the same error. Also, this time I tested it on a fresh table to make sure there were no issues with our production table. I ingested ~1B records from Kafka to a new S3 location, written to ~18k partitions. So it should be reproducible, let me know if you need any additional details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table
hudi-bot commented on PR #8847: URL: https://github.com/apache/hudi/pull/8847#issuecomment-1641122278 ## CI report: * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN * 1f8c2e4cb0da6d322b9f03657463b406f189350a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18673) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
CTTY commented on code in PR #9136: URL: https://github.com/apache/hudi/pull/9136#discussion_r1267380077 ## pom.xml: ## @@ -2614,6 +2614,18 @@ + + java17 + +-Xmx2g --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djol.magicFieldOffset=true Review Comment: That's a good question. I think these would be needed in runtime before we can confirm Hudi doesn't use some public Java 8 APIs that later got converted to private. But I'm not sure how we can confirm that without compiling Hudi with Java 17. Maybe we can try removing some of them to see if tests fail? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9224: seems to be working
hudi-bot commented on PR #9224: URL: https://github.com/apache/hudi/pull/9224#issuecomment-1641084419 ## CI report: * 74d2ddcf295168b82be4a26e383c8e7495487107 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18671) * 558ee6903fe1985b41ad70205bf648a2b464fc38 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18674) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
hudi-bot commented on PR #9136: URL: https://github.com/apache/hudi/pull/9136#issuecomment-1641084226 ## CI report: * a0e7207fb19738237d56fa0060c91cb7865ae9c0 UNKNOWN * cda1e7724e6267ec471d8c318cd22703a2ecb69f UNKNOWN * 0909e9991595a5f6c48181ff8db82a6dbebc49b8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18672) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9224: seems to be working
hudi-bot commented on PR #9224: URL: https://github.com/apache/hudi/pull/9224#issuecomment-1641078598 ## CI report: * 74d2ddcf295168b82be4a26e383c8e7495487107 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18671) * 558ee6903fe1985b41ad70205bf648a2b464fc38 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
CTTY commented on code in PR #9136: URL: https://github.com/apache/hudi/pull/9136#discussion_r1267318336 ## pom.xml: ## @@ -156,7 +156,7 @@ flink-clients flink-connector-kafka flink-hadoop-compatibility_2.12 -5.17.2 +7.5.3 Review Comment: Right, RocksDB `5.17.2` would throw `NoClassDefException` when running `TestHoodieLogFormat` ``` [ERROR] testBasicAppendAndScanMultipleFiles{DiskMapType, boolean, boolean, boolean}[10] Time elapsed: 0.118 s <<< ERROR! 2023-07-13T23:41:36.1420947Z java.lang.NoClassDefFoundError: Could not initialize class org.rocksdb.DBOptions ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
CTTY commented on code in PR #9136: URL: https://github.com/apache/hudi/pull/9136#discussion_r1267359024 ## packaging/bundle-validation/ci_run.sh: ## @@ -110,95 +112,116 @@ fi TMP_JARS_DIR=/tmp/jars/$(date +%s) mkdir -p $TMP_JARS_DIR -if [[ "$HUDI_VERSION" == *"SNAPSHOT" ]]; then - cp ${GITHUB_WORKSPACE}/packaging/hudi-flink-bundle/target/hudi-*-$HUDI_VERSION.jar $TMP_JARS_DIR/ - cp ${GITHUB_WORKSPACE}/packaging/hudi-hadoop-mr-bundle/target/hudi-*-$HUDI_VERSION.jar $TMP_JARS_DIR/ - cp ${GITHUB_WORKSPACE}/packaging/hudi-kafka-connect-bundle/target/hudi-*-$HUDI_VERSION.jar $TMP_JARS_DIR/ - cp ${GITHUB_WORKSPACE}/packaging/hudi-spark-bundle/target/hudi-*-$HUDI_VERSION.jar $TMP_JARS_DIR/ - cp ${GITHUB_WORKSPACE}/packaging/hudi-utilities-bundle/target/hudi-*-$HUDI_VERSION.jar $TMP_JARS_DIR/ - cp ${GITHUB_WORKSPACE}/packaging/hudi-utilities-slim-bundle/target/hudi-*-$HUDI_VERSION.jar $TMP_JARS_DIR/ - cp ${GITHUB_WORKSPACE}/packaging/hudi-metaserver-server-bundle/target/hudi-*-$HUDI_VERSION.jar $TMP_JARS_DIR/ - echo 'Validating jars below:' -else - echo 'Adding environment variables for bundles in the release candidate' - - HUDI_HADOOP_MR_BUNDLE_NAME=hudi-hadoop-mr-bundle - HUDI_KAFKA_CONNECT_BUNDLE_NAME=hudi-kafka-connect-bundle - HUDI_METASERVER_SERVER_BUNDLE_NAME=hudi-metaserver-server-bundle - - if [[ ${SPARK_PROFILE} == 'spark' ]]; then -HUDI_SPARK_BUNDLE_NAME=hudi-spark-bundle_2.11 -HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.11 -HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.11 - elif [[ ${SPARK_PROFILE} == 'spark2.4' ]]; then -HUDI_SPARK_BUNDLE_NAME=hudi-spark2.4-bundle_2.11 -HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.11 -HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.11 - elif [[ ${SPARK_PROFILE} == 'spark3.1' ]]; then -HUDI_SPARK_BUNDLE_NAME=hudi-spark3.1-bundle_2.12 -HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.12 -HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.12 - elif [[ ${SPARK_PROFILE} == 'spark3.2' ]]; then -HUDI_SPARK_BUNDLE_NAME=hudi-spark3.2-bundle_2.12 -HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.12 -HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.12 - elif [[ ${SPARK_PROFILE} == 'spark3.3' ]]; then -HUDI_SPARK_BUNDLE_NAME=hudi-spark3.3-bundle_2.12 -HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.12 -HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.12 - elif [[ ${SPARK_PROFILE} == 'spark3' ]]; then -HUDI_SPARK_BUNDLE_NAME=hudi-spark3-bundle_2.12 -HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.12 -HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.12 - fi +if [[ -z "$MODE" ]] || [[ "$MODE" != "java17" ]]; then + if [[ "$HUDI_VERSION" == *"SNAPSHOT" ]]; then +cp ${GITHUB_WORKSPACE}/packaging/hudi-flink-bundle/target/hudi-*-$HUDI_VERSION.jar $TMP_JARS_DIR/ +cp ${GITHUB_WORKSPACE}/packaging/hudi-hadoop-mr-bundle/target/hudi-*-$HUDI_VERSION.jar $TMP_JARS_DIR/ +cp ${GITHUB_WORKSPACE}/packaging/hudi-kafka-connect-bundle/target/hudi-*-$HUDI_VERSION.jar $TMP_JARS_DIR/ +cp ${GITHUB_WORKSPACE}/packaging/hudi-spark-bundle/target/hudi-*-$HUDI_VERSION.jar $TMP_JARS_DIR/ +cp ${GITHUB_WORKSPACE}/packaging/hudi-utilities-bundle/target/hudi-*-$HUDI_VERSION.jar $TMP_JARS_DIR/ +cp ${GITHUB_WORKSPACE}/packaging/hudi-utilities-slim-bundle/target/hudi-*-$HUDI_VERSION.jar $TMP_JARS_DIR/ +cp ${GITHUB_WORKSPACE}/packaging/hudi-metaserver-server-bundle/target/hudi-*-$HUDI_VERSION.jar $TMP_JARS_DIR/ +echo 'Validating jars below:' + else +echo 'Adding environment variables for bundles in the release candidate' + +HUDI_HADOOP_MR_BUNDLE_NAME=hudi-hadoop-mr-bundle +HUDI_KAFKA_CONNECT_BUNDLE_NAME=hudi-kafka-connect-bundle +HUDI_METASERVER_SERVER_BUNDLE_NAME=hudi-metaserver-server-bundle + +if [[ ${SPARK_PROFILE} == 'spark' ]]; then + HUDI_SPARK_BUNDLE_NAME=hudi-spark-bundle_2.11 + HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.11 + HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.11 +elif [[ ${SPARK_PROFILE} == 'spark2.4' ]]; then + HUDI_SPARK_BUNDLE_NAME=hudi-spark2.4-bundle_2.11 + HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.11 + HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.11 +elif [[ ${SPARK_PROFILE} == 'spark3.1' ]]; then + HUDI_SPARK_BUNDLE_NAME=hudi-spark3.1-bundle_2.12 + HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.12 + HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.12 +elif [[ ${SPARK_PROFILE} == 'spark3.2' ]]; then + HUDI_SPARK_BUNDLE_NAME=hudi-spark3.2-bundle_2.12 + HUDI_UTILITIES_BUNDLE_NAME=hudi-utilities-bundle_2.12 + HUDI_UTILITIES_SLIM_BUNDLE_NAME=hudi-utilities-slim-bundle_2.12 +elif [[ ${SPARK_PROFILE} == 'spark3.3' ]]; then + HUDI_SPARK_BUNDLE_NAME=hudi-spark3.3-bund
[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
CTTY commented on code in PR #9136: URL: https://github.com/apache/hudi/pull/9136#discussion_r1267358058 ## .github/workflows/bot.yml: ## @@ -112,6 +112,91 @@ jobs: run: mvn test -Pfunctional-tests -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS + test-spark-java17: +runs-on: ubuntu-latest +strategy: + matrix: +include: + - scalaProfile: "scala-2.12" +sparkProfile: "spark3.3" +sparkModules: "hudi-spark-datasource/hudi-spark3.3.x" + - scalaProfile: "scala-2.12" +sparkProfile: "spark3.4" +sparkModules: "hudi-spark-datasource/hudi-spark3.4.x" + +steps: + - uses: actions/checkout@v3 + - name: Set up JDK 8 +uses: actions/setup-java@v3 +with: + java-version: '8' + distribution: 'adopt' + architecture: x64 + - name: Build Project +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} +run: + mvn clean install -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -DskipTests=true $MVN_ARGS + - name: Set up JDK 17 +uses: actions/setup-java@v3 +with: + java-version: '17' + distribution: 'adopt' + architecture: x64 + - name: Quickstart Test +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} +run: + mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl hudi-examples/hudi-examples-spark $MVN_ARGS + - name: UT - Common & Spark +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} + SPARK_MODULES: ${{ matrix.sparkModules }} +if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 as it's covered by Azure CI +run: + mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl "hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS + - name: FT - Spark +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} + SPARK_MODULES: ${{ matrix.sparkModules }} +if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 as it's covered by Azure CI +run: + mvn test -Pfunctional-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS + + docker-test-java17: Review Comment: those tests are generally the same as bundle validation but there is one difference: It requires building Hudi within Docker as it runs tests with `mvn test` command, while bundle validation build Hudi outside of Docker and only copy jars/bundles to Docker. If we consolidates them as one then the job would need to build twice, it would make the job much slower. If we just seperate them into 2 jobs then docker test can only build `hudi-common` modules in Docker which is relatively fast and bundle validation can keep the same behavior -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
CTTY commented on code in PR #9136: URL: https://github.com/apache/hudi/pull/9136#discussion_r1267351602 ## hudi-common/pom.xml: ## @@ -248,6 +248,13 @@ + + org.apache.spark + spark-streaming-kafka-0-10_${scala.binary.version} + test + ${spark.version} + Review Comment: I can't remember exactly, but I think there were some issues when this is removed. Will need to double check -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
CTTY commented on code in PR #9136: URL: https://github.com/apache/hudi/pull/9136#discussion_r1267352893 ## .github/workflows/bot.yml: ## @@ -112,6 +112,91 @@ jobs: run: mvn test -Pfunctional-tests -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS + test-spark-java17: +runs-on: ubuntu-latest +strategy: + matrix: +include: + - scalaProfile: "scala-2.12" +sparkProfile: "spark3.3" +sparkModules: "hudi-spark-datasource/hudi-spark3.3.x" + - scalaProfile: "scala-2.12" +sparkProfile: "spark3.4" +sparkModules: "hudi-spark-datasource/hudi-spark3.4.x" + +steps: + - uses: actions/checkout@v3 + - name: Set up JDK 8 +uses: actions/setup-java@v3 +with: + java-version: '8' + distribution: 'adopt' + architecture: x64 + - name: Build Project +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} +run: + mvn clean install -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -DskipTests=true $MVN_ARGS + - name: Set up JDK 17 +uses: actions/setup-java@v3 +with: + java-version: '17' + distribution: 'adopt' + architecture: x64 + - name: Quickstart Test +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} +run: + mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl hudi-examples/hudi-examples-spark $MVN_ARGS + - name: UT - Common & Spark +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} + SPARK_MODULES: ${{ matrix.sparkModules }} +if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 as it's covered by Azure CI +run: + mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl "hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS + - name: FT - Spark +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} + SPARK_MODULES: ${{ matrix.sparkModules }} +if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 as it's covered by Azure CI +run: + mvn test -Pfunctional-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS + + docker-test-java17: +runs-on: ubuntu-latest +strategy: + matrix: +include: + - flinkProfile: 'flink1.17' +sparkProfile: 'spark3.4' +sparkRuntime: 'spark3.4.0' Review Comment: Existing bundle validation uses Spark 3.4.0 still. I guess we can bump it but should we use a seperate PR or? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
CTTY commented on code in PR #9136: URL: https://github.com/apache/hudi/pull/9136#discussion_r1267351602 ## hudi-common/pom.xml: ## @@ -248,6 +248,13 @@ + + org.apache.spark + spark-streaming-kafka-0-10_${scala.binary.version} + test + ${spark.version} + Review Comment: I can't remember offhand, but I remember there were some issues when this is removed. Will need to double check -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
CTTY commented on code in PR #9136: URL: https://github.com/apache/hudi/pull/9136#discussion_r1267350100 ## hudi-common/src/test/java/org/apache/hudi/avro/TestHoodieAvroUtils.java: ## @@ -450,10 +450,8 @@ public void testGenerateProjectionSchema() { assertTrue(fieldNames1.contains("_row_key")); assertTrue(fieldNames1.contains("timestamp")); -assertEquals("Field fake_field not found in log schema. Query cannot proceed! Derived Schema Fields: " -+ "[non_pii_col, _hoodie_commit_time, _row_key, _hoodie_partition_path, _hoodie_record_key, pii_col," -+ " _hoodie_commit_seqno, _hoodie_file_name, timestamp]", -assertThrows(HoodieException.class, () -> -HoodieAvroUtils.generateProjectionSchema(originalSchema, Arrays.asList("_row_key", "timestamp", "fake_field"))).getMessage()); +assertTrue(assertThrows(HoodieException.class, () -> +HoodieAvroUtils.generateProjectionSchema(originalSchema, Arrays.asList("_row_key", "timestamp", "fake_field"))) +.getMessage().contains("Field fake_field not found in log schema. Query cannot proceed!")); Review Comment: The order of results seems to change in Java 17, but the result is the same. Ref: https://github.com/apache/hudi/pull/8955#issuecomment-1624527608 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] soumilshah1995 closed issue #9210: [SUPPORT] Apache Hudi Partition Compaction
soumilshah1995 closed issue #9210: [SUPPORT] Apache Hudi Partition Compaction URL: https://github.com/apache/hudi/issues/9210 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] CTTY commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
CTTY commented on code in PR #9136: URL: https://github.com/apache/hudi/pull/9136#discussion_r1267318336 ## pom.xml: ## @@ -156,7 +156,7 @@ flink-clients flink-connector-kafka flink-hadoop-compatibility_2.12 -5.17.2 +7.5.3 Review Comment: Right, RocksDB `5.17.2` would throw `NoClassDefException` when running `TestHoodieLogFormat` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9224: seems to be working
hudi-bot commented on PR #9224: URL: https://github.com/apache/hudi/pull/9224#issuecomment-1640957631 ## CI report: * 74d2ddcf295168b82be4a26e383c8e7495487107 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18671) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
hudi-bot commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640956642 ## CI report: * 5dc00a2d02cca3b242a54c3294ef3c30d6a66b3f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18670) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
hudi-bot commented on PR #9136: URL: https://github.com/apache/hudi/pull/9136#issuecomment-1640947990 ## CI report: * a0e7207fb19738237d56fa0060c91cb7865ae9c0 UNKNOWN * cda1e7724e6267ec471d8c318cd22703a2ecb69f UNKNOWN * 73d4660734fbcf528b482df2460944ba51431eea Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18645) * 0909e9991595a5f6c48181ff8db82a6dbebc49b8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18672) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6558) Support SQL Update for CoW when no precombine field is defined
[ https://issues.apache.org/jira/browse/HUDI-6558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kazdy updated HUDI-6558: Description: Updates without precombine field (for COW only) is already supported in MERGE INTO > Support SQL Update for CoW when no precombine field is defined > -- > > Key: HUDI-6558 > URL: https://issues.apache.org/jira/browse/HUDI-6558 > Project: Apache Hudi > Issue Type: Improvement >Reporter: kazdy >Priority: Major > > Updates without precombine field (for COW only) is already supported in MERGE > INTO -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
yihua commented on code in PR #9136: URL: https://github.com/apache/hudi/pull/9136#discussion_r1267265487 ## pom.xml: ## @@ -2614,6 +2614,18 @@ + + java17 + +-Xmx2g --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djol.magicFieldOffset=true Review Comment: Are these args needed for running production jobs on Java 17? ## pom.xml: ## @@ -156,7 +156,7 @@ flink-clients flink-connector-kafka flink-hadoop-compatibility_2.12 -5.17.2 +7.5.3 Review Comment: does RocksDB `5.17.2` not work? Dependency version upgrade has larger impact. ## packaging/bundle-validation/docker_test_java17.sh: ## @@ -0,0 +1,170 @@ +#!/bin/bash + +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# +# NOTE: this script runs inside hudi-ci-bundle-validation container +# $WORKDIR/jars/ is to mount to a host directory where bundle jars are placed +# $WORKDIR/data/ is to mount to a host directory where test data are placed with structures like +#- /schema.avsc +#- /data/ +# + Review Comment: Could we consolidate the test logic of this into `validate.sh` and reuse existing validate-bundle job? ## style/checkstyle.xml: ## @@ -269,7 +269,7 @@ + value="^java\.util\.Optional, ^org\.junit\.(?!jupiter|platform|contrib|Rule|runner|Assume)(.*)"/> Review Comment: Let's use the juniper version instead of changing this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6558) Support SQL Update for CoW when no precombine field is defined
kazdy created HUDI-6558: --- Summary: Support SQL Update for CoW when no precombine field is defined Key: HUDI-6558 URL: https://issues.apache.org/jira/browse/HUDI-6558 Project: Apache Hudi Issue Type: Improvement Reporter: kazdy -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] Sam-Serpoosh commented on issue #9143: [SUPPORT] Failure to delete records with missing attributes from PostgresDebeziumSource
Sam-Serpoosh commented on issue #9143: URL: https://github.com/apache/hudi/issues/9143#issuecomment-1640911318 @ad1happy2go Looks like `REPLICA IDENTITY FULL` is mostly discouraged by PG ([interesting article](https://xata.io/blog/replica-identity-full-performance) and [SO Thread](https://stackoverflow.com/a/67979022/1433222)). It would be **ideal** not to have to change this setting to `FULL` to avoid the downsides. I know Hudi has the limitation on **global uniqueness** when dealing with **partitioned Hudi Tables**. So is there any way to make this work with **partitioned Hudi Tables** without having to set REPLICA IDENTITY to `FULL`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-6556) Big Query sync with master code failing for partitioned table with the Exception
[ https://issues.apache.org/jira/browse/HUDI-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744345#comment-17744345 ] Aditya Goenka edited comment on HUDI-6556 at 7/18/23 7:53 PM: -- While testing, I was using the partitionPath as slash encoded which was causing this issue, Its working as expected when I updated partition column so closing this issue. was (Author: JIRAUSER299651): While testing, I was using the partitionPath as slash encoded which was causing this issue, Its working as expected so closing this issue. > Big Query sync with master code failing for partitioned table with the > Exception > > > Key: HUDI-6556 > URL: https://issues.apache.org/jira/browse/HUDI-6556 > Project: Apache Hudi > Issue Type: Bug > Components: meta-sync >Reporter: Aditya Goenka >Priority: Blocker > Labels: 0.14.0 > > While doing Big Query Sync for partitioned table, its failing with below > Exception - > error message: Failed to add partition key partitionpath (type: TYPE_STRING) > to schema, because another column with the same name was already present. > This is not allowed. Full partition schema: [partitionpath:TYPE_STRING]." -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6556) Big Query sync with master code failing for partitioned table with the Exception
[ https://issues.apache.org/jira/browse/HUDI-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Goenka closed HUDI-6556. --- Resolution: Not A Problem > Big Query sync with master code failing for partitioned table with the > Exception > > > Key: HUDI-6556 > URL: https://issues.apache.org/jira/browse/HUDI-6556 > Project: Apache Hudi > Issue Type: Bug > Components: meta-sync >Reporter: Aditya Goenka >Priority: Blocker > Labels: 0.14.0 > > While doing Big Query Sync for partitioned table, its failing with below > Exception - > error message: Failed to add partition key partitionpath (type: TYPE_STRING) > to schema, because another column with the same name was already present. > This is not allowed. Full partition schema: [partitionpath:TYPE_STRING]." -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6556) Big Query sync with master code failing for partitioned table with the Exception
[ https://issues.apache.org/jira/browse/HUDI-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744345#comment-17744345 ] Aditya Goenka commented on HUDI-6556: - While testing, I was using the partitionPath as slash encoded which was causing this issue, Its working as expected so closing this issue. > Big Query sync with master code failing for partitioned table with the > Exception > > > Key: HUDI-6556 > URL: https://issues.apache.org/jira/browse/HUDI-6556 > Project: Apache Hudi > Issue Type: Bug > Components: meta-sync >Reporter: Aditya Goenka >Priority: Blocker > Labels: 0.14.0 > > While doing Big Query Sync for partitioned table, its failing with below > Exception - > error message: Failed to add partition key partitionpath (type: TYPE_STRING) > to schema, because another column with the same name was already present. > This is not allowed. Full partition schema: [partitionpath:TYPE_STRING]." -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] ad1happy2go commented on issue #9042: [SUPPORT] Cannot write nullable values to non-null column
ad1happy2go commented on issue #9042: URL: https://github.com/apache/hudi/issues/9042#issuecomment-1640905833 yes , I also confirmed with master I am not seeing this issue. @dht7 Can you check with master code if possible if you are still facing this issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table
hudi-bot commented on PR #8847: URL: https://github.com/apache/hudi/pull/8847#issuecomment-1640904862 ## CI report: * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN * 29abf1ce1345bfe299685fcc3b496f365f109e76 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18664) * 1f8c2e4cb0da6d322b9f03657463b406f189350a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18673) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
yihua commented on code in PR #9136: URL: https://github.com/apache/hudi/pull/9136#discussion_r1267072983 ## .github/workflows/bot.yml: ## @@ -112,6 +112,91 @@ jobs: run: mvn test -Pfunctional-tests -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS + test-spark-java17: +runs-on: ubuntu-latest +strategy: + matrix: +include: + - scalaProfile: "scala-2.12" +sparkProfile: "spark3.3" +sparkModules: "hudi-spark-datasource/hudi-spark3.3.x" + - scalaProfile: "scala-2.12" +sparkProfile: "spark3.4" +sparkModules: "hudi-spark-datasource/hudi-spark3.4.x" + +steps: + - uses: actions/checkout@v3 + - name: Set up JDK 8 +uses: actions/setup-java@v3 +with: + java-version: '8' + distribution: 'adopt' + architecture: x64 + - name: Build Project +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} +run: + mvn clean install -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -DskipTests=true $MVN_ARGS + - name: Set up JDK 17 +uses: actions/setup-java@v3 +with: + java-version: '17' + distribution: 'adopt' + architecture: x64 + - name: Quickstart Test +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} +run: + mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl hudi-examples/hudi-examples-spark $MVN_ARGS + - name: UT - Common & Spark +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} + SPARK_MODULES: ${{ matrix.sparkModules }} +if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 as it's covered by Azure CI +run: + mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl "hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS + - name: FT - Spark +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} + SPARK_MODULES: ${{ matrix.sparkModules }} +if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 as it's covered by Azure CI +run: + mvn test -Pfunctional-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS + + docker-test-java17: Review Comment: Could this be run with `validate-bundles` since it already validates bundles on Java 17? Any reason to have a separate job here? ## .github/workflows/bot.yml: ## @@ -112,6 +112,91 @@ jobs: run: mvn test -Pfunctional-tests -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl "$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS + test-spark-java17: +runs-on: ubuntu-latest +strategy: + matrix: +include: + - scalaProfile: "scala-2.12" +sparkProfile: "spark3.3" +sparkModules: "hudi-spark-datasource/hudi-spark3.3.x" + - scalaProfile: "scala-2.12" +sparkProfile: "spark3.4" +sparkModules: "hudi-spark-datasource/hudi-spark3.4.x" + +steps: + - uses: actions/checkout@v3 + - name: Set up JDK 8 +uses: actions/setup-java@v3 +with: + java-version: '8' + distribution: 'adopt' + architecture: x64 + - name: Build Project +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} +run: + mvn clean install -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -DskipTests=true $MVN_ARGS + - name: Set up JDK 17 +uses: actions/setup-java@v3 +with: + java-version: '17' + distribution: 'adopt' + architecture: x64 + - name: Quickstart Test +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} +run: + mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl hudi-examples/hudi-examples-spark $MVN_ARGS + - name: UT - Common & Spark +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} + SPARK_MODULES: ${{ matrix.sparkModules }} +if: ${{ !endsWith(env.SPARK_PROFILE, '3.2') }} # skip test spark 3.2 as it's covered by Azure CI +run: + mvn test -Punit-tests -Pjava17 -D"$SCALA_PROFILE" -D"$SPARK_PROFILE" -pl "hudi-common,$SPARK_COMMON_MODULES,$SPARK_MODULES" $MVN_ARGS + - name: FT - Spark +env: + SCALA_PROFILE: ${{ matrix.scalaProfile }} + SPARK_PROFILE: ${{ matrix.sparkProfile }} + SPARK_MODULES: ${{ matrix.spark
[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.
hudi-bot commented on PR #9203: URL: https://github.com/apache/hudi/pull/9203#issuecomment-1640897010 ## CI report: * 585935c37efc35994dd721ba2d8f05c9cf775470 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18667) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18665) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18669) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
hudi-bot commented on PR #9136: URL: https://github.com/apache/hudi/pull/9136#issuecomment-1640896780 ## CI report: * a0e7207fb19738237d56fa0060c91cb7865ae9c0 UNKNOWN * cda1e7724e6267ec471d8c318cd22703a2ecb69f UNKNOWN * 73d4660734fbcf528b482df2460944ba51431eea Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18645) * 0909e9991595a5f6c48181ff8db82a6dbebc49b8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table
hudi-bot commented on PR #8847: URL: https://github.com/apache/hudi/pull/8847#issuecomment-1640896014 ## CI report: * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN * 29abf1ce1345bfe299685fcc3b496f365f109e76 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18664) * 1f8c2e4cb0da6d322b9f03657463b406f189350a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] amrishlal commented on issue #9042: [SUPPORT] Cannot write nullable values to non-null column
amrishlal commented on issue #9042: URL: https://github.com/apache/hudi/issues/9042#issuecomment-1640857564 @ad1happy2go I am not able to reproduce the issue against the latest master version of hudi using either spark-3.1 and spark-3.2 using the steps you outlined. Do we know if this issue is limited only to older version of Hudi (version : 0.12.2 as reported in the description)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9224: seems to be working
hudi-bot commented on PR #9224: URL: https://github.com/apache/hudi/pull/9224#issuecomment-1640765610 ## CI report: * 74d2ddcf295168b82be4a26e383c8e7495487107 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18671) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9224: seems to be working
hudi-bot commented on PR #9224: URL: https://github.com/apache/hudi/pull/9224#issuecomment-1640753251 ## CI report: * 74d2ddcf295168b82be4a26e383c8e7495487107 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6315) Optimize UPSERT and DELETE codepath to use meta fields instead of key generation and index lookup
[ https://issues.apache.org/jira/browse/HUDI-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amrish Lal closed HUDI-6315. Resolution: Done > Optimize UPSERT and DELETE codepath to use meta fields instead of key > generation and index lookup > - > > Key: HUDI-6315 > URL: https://issues.apache.org/jira/browse/HUDI-6315 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Amrish Lal >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > For MIT, Update and Delete, we do a look up in hudi to find matching records > based o the predicates and then trigger the writes following it. But the > records fetched from hudi already contains all meta fields that is required > for key generation and index look up (like the record key, partition path, > filename, commit time). But as of now, we drop those meta fields and trigger > an upsert to hudi (as though someone is writing via spark-datasource). This > goes via regular code path of key generation and index lookup which is > unnecessary. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6315) Optimize UPSERT and DELETE codepath to use meta fields instead of key generation and index lookup
[ https://issues.apache.org/jira/browse/HUDI-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744323#comment-17744323 ] Amrish Lal commented on HUDI-6315: -- Issue has been resolved using the pull requests linkedin to this ticket. > Optimize UPSERT and DELETE codepath to use meta fields instead of key > generation and index lookup > - > > Key: HUDI-6315 > URL: https://issues.apache.org/jira/browse/HUDI-6315 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Amrish Lal >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > For MIT, Update and Delete, we do a look up in hudi to find matching records > based o the predicates and then trigger the writes following it. But the > records fetched from hudi already contains all meta fields that is required > for key generation and index look up (like the record key, partition path, > filename, commit time). But as of now, we drop those meta fields and trigger > an upsert to hudi (as though someone is writing via spark-datasource). This > goes via regular code path of key generation and index lookup which is > unnecessary. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-6315) Optimize UPSERT and DELETE codepath to use meta fields instead of key generation and index lookup
[ https://issues.apache.org/jira/browse/HUDI-6315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amrish Lal resolved HUDI-6315. -- > Optimize UPSERT and DELETE codepath to use meta fields instead of key > generation and index lookup > - > > Key: HUDI-6315 > URL: https://issues.apache.org/jira/browse/HUDI-6315 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Amrish Lal >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > For MIT, Update and Delete, we do a look up in hudi to find matching records > based o the predicates and then trigger the writes following it. But the > records fetched from hudi already contains all meta fields that is required > for key generation and index look up (like the record key, partition path, > filename, commit time). But as of now, we drop those meta fields and trigger > an upsert to hudi (as though someone is writing via spark-datasource). This > goes via regular code path of key generation and index lookup which is > unnecessary. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] jonvex opened a new pull request, #9224: seems to be working
jonvex opened a new pull request, #9224: URL: https://github.com/apache/hudi/pull/9224 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
hudi-bot commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640675026 ## CI report: * 8b6fba8468a155d39a66dc57acb6ac8c5e29b294 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18668) * 5dc00a2d02cca3b242a54c3294ef3c30d6a66b3f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18670) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8847: [HUDI-2071] Support Reading Bootstrap MOR RT Table In Spark DataSource Table
hudi-bot commented on PR #8847: URL: https://github.com/apache/hudi/pull/8847#issuecomment-1640665241 ## CI report: * fe991dc492e5bec19b4bfd91dc0b210e6b152b7a UNKNOWN * 29abf1ce1345bfe299685fcc3b496f365f109e76 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18664) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
hudi-bot commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640665363 ## CI report: * 2d4e285ba5ef3c5b07ec91af6ab3a2669d2b485d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17565) * 8b6fba8468a155d39a66dc57acb6ac8c5e29b294 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18668) * 5dc00a2d02cca3b242a54c3294ef3c30d6a66b3f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ehurheap commented on issue #9079: [SUPPORT] Hudi delete not working when using UuidKeyGenerator
ehurheap commented on issue #9079: URL: https://github.com/apache/hudi/issues/9079#issuecomment-1640664456 Yes, the workaround using the writeClient that we discussed in [slack](https://apache-hudi.slack.com/archives/C4D716NPQ/p1689111633808279?thread_ts=1687983367.526889&cid=C4D716NPQ) worked for me. Here is a summary: we build the writeClient: ``` def buildWriteClient(): SparkRDDWriteClient[_] = { val lockProperties = new Properties() // populate lockProperties as appropriate val metricsProperties = new Properties() // populate metricsProperties as appropriate val writerConfig = HoodieWriteConfig .newBuilder() .withCompactionConfig( HoodieCompactionConfig .newBuilder() .withInlineCompaction(true) .withScheduleInlineCompaction(false) .withMaxNumDeltaCommitsBeforeCompaction(1) .build() ) .withArchivalConfig(HoodieArchivalConfig.newBuilder().withAutoArchive(false).build()) .withCleanConfig(HoodieCleanConfig.newBuilder().withAutoClean(false).build()) .withMetadataConfig(HoodieMetadataConfig.newBuilder().enable(false).build()) .withLockConfig(HoodieLockConfig.newBuilder().fromProperties(lockProperties).build()) .withMetricsConfig(HoodieMetricsConfig.newBuilder().fromProperties(metricsProperties).build()) .withDeleteParallelism(config.deleteParallelism) .withPath(config.tablePath) .forTable(datalakeRecord.tableName) .build() val engineContext: HoodieEngineContext = new HoodieSparkEngineContext( JavaSparkContext.fromSparkContext(sparkContext) ) new SparkRDDWriteClient(engineContext, writerConfig) } ``` Then run delete and compaction for the specified keys: ``` var deleteInstant: String = "" try { deleteInstant = writeClient.startCommit() writeClient.delete(keysToDelete, deleteInstant) // :TRICKY: explicitly calling compaction here: although the write client was configured to auto compact in-line, compaction is not in fact triggered by this delete operation. val maybeCompactionInstant = writeClient.scheduleCompaction(org.apache.hudi.common.util.Option.empty()) if (maybeCompactionInstant.isPresent) writeClient.compact(maybeCompactionInstant.get) else log.warn( s"Unable to schedule compaction after delete operation at instant ${deleteInstant}" ) } catch { case t: Throwable => logErrorAndExit(s"Delete operation failed for instant ${deleteInstant} due to ", t) } finally { log.info(s"Finished delete operation for instant ${deleteInstant}") writeClient.close() } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] CTTY commented on pull request #9136: [HUDI-6509] Add GitHub CI for Java 17
CTTY commented on PR #9136: URL: https://github.com/apache/hudi/pull/9136#issuecomment-1640605864 It seems `validate-bundles(flink1.17, Spark 3.4, Spark 3.4.0)` just consistently fail with JDK17 on issue below: ``` Connecting to jdbc:hive2://localhost:1/default 23/07/17 17:45:48 [main]: WARN jdbc.HiveConnection: Failed to connect to localhost:1 Could not open connection to the HS2 server. Please check the server URI and if the URI is correct, then ask the administrator to check the server status. Error: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:1/default: java.net.ConnectException: Connection refused (Connection refused) (state=08S01,code=0) Cannot run commands specified using -e. No current connection Error: validate.sh HiveQL validation failed. Error: Process completed with exit code 1. ``` Need to look into this, otherwise everything looks good. newly added docker-test-java17 and test-spark-java17 are working fine -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
yihua commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640577138 @KnightChess do you have any number on the performance improvement on updating MDT from this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
yihua commented on code in PR #8856: URL: https://github.com/apache/hudi/pull/8856#discussion_r1267028523 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -850,59 +851,58 @@ public static HoodieData convertFilesToBloomFilterRecords(HoodieEn String instantTime) { HoodieData allRecordsRDD = engineContext.emptyHoodieData(); -List>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet() -.stream().map(e -> Pair.of(e.getKey(), e.getValue())).collect(Collectors.toList()); -int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); -HoodieData>> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList, parallelism); +List> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet().stream().flatMap(entry -> { + List filesList = entry.getValue(); + return filesList.stream().map(file -> Pair.of(entry.getKey(), file)); +}).collect(Collectors.toList()); -HoodieData deletedFilesRecordsRDD = partitionToDeletedFilesRDD.flatMap(partitionToDeletedFilesPair -> { - final String partitionName = partitionToDeletedFilesPair.getLeft(); - final List deletedFileList = partitionToDeletedFilesPair.getRight(); - return deletedFileList.stream().flatMap(deletedFile -> { -if (!FSUtils.isBaseFile(new Path(deletedFile))) { - return Stream.empty(); -} +int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); +HoodieData> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList, parallelism); -final String partition = getPartitionIdentifier(partitionName); -return Stream.of(HoodieMetadataPayload.createBloomFilterMetadataRecord( -partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, ByteBuffer.allocate(0), true)); - }).iterator(); -}); +HoodieData deletedFilesRecordsRDD = partitionToDeletedFilesRDD.map(partitionToDeletedFilePair -> { + String partitionName = partitionToDeletedFilePair.getLeft(); + String deletedFile = partitionToDeletedFilePair.getRight(); + if (!FSUtils.isBaseFile(new Path(deletedFile))) { +return null; + } + final String partition = getPartitionIdentifier(partitionName); + return (HoodieRecord) (HoodieMetadataPayload.createBloomFilterMetadataRecord( + partition, deletedFile, instantTime, StringUtils.EMPTY_STRING, ByteBuffer.allocate(0), true)); +}).filter(Objects::nonNull); allRecordsRDD = allRecordsRDD.union(deletedFilesRecordsRDD); -List>> partitionToAppendedFilesList = partitionToAppendedFiles.entrySet() -.stream().map(entry -> Pair.of(entry.getKey(), entry.getValue())).collect(Collectors.toList()); +List> partitionToAppendedFilesList = partitionToAppendedFiles.entrySet().stream().flatMap(entry -> { + Set filesSet = entry.getValue().keySet(); + return filesSet.stream().map(file -> Pair.of(entry.getKey(), file)); Review Comment: ```suggestion return entry.getValue().keySet().stream().map(file -> Pair.of(entry.getKey(), file)); ``` ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -850,59 +851,58 @@ public static HoodieData convertFilesToBloomFilterRecords(HoodieEn String instantTime) { HoodieData allRecordsRDD = engineContext.emptyHoodieData(); -List>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet() -.stream().map(e -> Pair.of(e.getKey(), e.getValue())).collect(Collectors.toList()); -int parallelism = Math.max(Math.min(partitionToDeletedFilesList.size(), recordsGenerationParams.getBloomIndexParallelism()), 1); -HoodieData>> partitionToDeletedFilesRDD = engineContext.parallelize(partitionToDeletedFilesList, parallelism); +List> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet().stream().flatMap(entry -> { + List filesList = entry.getValue(); + return filesList.stream().map(file -> Pair.of(entry.getKey(), file)); Review Comment: ```suggestion return entry.getValue().stream().map(file -> Pair.of(entry.getKey(), file)); ``` ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -927,48 +927,44 @@ public static HoodieData convertFilesToColumnStatsRecords(HoodieEn return engineContext.emptyHoodieData(); } -final List>> partitionToDeletedFilesList = partitionToDeletedFiles.entrySet().stream() -.map(e -> Pair.of(e.getKey(), e.getValue())) -.collect(Collectors.toList()); +List> partitionToDeletedFilesList =
[GitHub] [hudi] hudi-bot commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.
hudi-bot commented on PR #9106: URL: https://github.com/apache/hudi/pull/9106#issuecomment-1640565751 ## CI report: * 16ae34ec0e91811bae11a980749f5b77d048adba Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18519) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18646) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18663) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9223: [HUDI-6553] Speedup column stats and bloom index creation on large datasets.
hudi-bot commented on PR #9223: URL: https://github.com/apache/hudi/pull/9223#issuecomment-1640556024 ## CI report: * 80179dfbcb1179d93d6aaced4f956db72363a347 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18662) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.
hudi-bot commented on PR #9203: URL: https://github.com/apache/hudi/pull/9203#issuecomment-1640555900 ## CI report: * 585935c37efc35994dd721ba2d8f05c9cf775470 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18667) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18665) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18669) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6557) Deletes are not working if using custom timestamp for replica identity DEFAULT
Aditya Goenka created HUDI-6557: --- Summary: Deletes are not working if using custom timestamp for replica identity DEFAULT Key: HUDI-6557 URL: https://issues.apache.org/jira/browse/HUDI-6557 Project: Apache Hudi Issue Type: Bug Components: deltastreamer Reporter: Aditya Goenka Fix For: 0.14.0 Debezium is giving only primary key value for DELETE records and for others it's giving as null.or 0. Timestamp convertor then converts 0 to 1970-01-01 and tries to delete the record from that partition and delete fails. Github issue - [https://github.com/apache/hudi/issues/9143] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6411) Make SQL parameters case insensitive
[ https://issues.apache.org/jira/browse/HUDI-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Goenka closed HUDI-6411. --- Resolution: Fixed > Make SQL parameters case insensitive > - > > Key: HUDI-6411 > URL: https://issues.apache.org/jira/browse/HUDI-6411 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: Aditya Goenka >Priority: Major > Labels: pull-request-available > Fix For: 1.1.0, 0.15.0 > > > Users should give spark sql parameters(like - recordKey, precombineField) > with any case , and we should be able to parse it. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6411) Make SQL parameters case insensitive
[ https://issues.apache.org/jira/browse/HUDI-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744285#comment-17744285 ] Aditya Goenka commented on HUDI-6411: - The PR is merged. Closing the JIRA. > Make SQL parameters case insensitive > - > > Key: HUDI-6411 > URL: https://issues.apache.org/jira/browse/HUDI-6411 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: Aditya Goenka >Priority: Major > Labels: pull-request-available > Fix For: 1.1.0, 0.15.0 > > > Users should give spark sql parameters(like - recordKey, precombineField) > with any case , and we should be able to parse it. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6411) Make SQL parameters case insensitive
[ https://issues.apache.org/jira/browse/HUDI-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Goenka updated HUDI-6411: Status: Patch Available (was: In Progress) > Make SQL parameters case insensitive > - > > Key: HUDI-6411 > URL: https://issues.apache.org/jira/browse/HUDI-6411 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: Aditya Goenka >Priority: Major > Labels: pull-request-available > Fix For: 1.1.0, 0.15.0 > > > Users should give spark sql parameters(like - recordKey, precombineField) > with any case , and we should be able to parse it. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-6411) Make SQL parameters case insensitive
[ https://issues.apache.org/jira/browse/HUDI-6411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Goenka resolved HUDI-6411. - > Make SQL parameters case insensitive > - > > Key: HUDI-6411 > URL: https://issues.apache.org/jira/browse/HUDI-6411 > Project: Apache Hudi > Issue Type: Improvement > Components: spark-sql >Reporter: Aditya Goenka >Priority: Major > Labels: pull-request-available > Fix For: 1.1.0, 0.15.0 > > > Users should give spark sql parameters(like - recordKey, precombineField) > with any case , and we should be able to parse it. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.
hudi-bot commented on PR #9203: URL: https://github.com/apache/hudi/pull/9203#issuecomment-1640493692 ## CI report: * 585935c37efc35994dd721ba2d8f05c9cf775470 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18667) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18665) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18669) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
hudi-bot commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640492244 ## CI report: * 2d4e285ba5ef3c5b07ec91af6ab3a2669d2b485d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17565) * 8b6fba8468a155d39a66dc57acb6ac8c5e29b294 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18668) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [HUDI-6520] [DOCS] Rename Deltastreamer and related classes and configs (#9179)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 6d5a4e2a6a7 [HUDI-6520] [DOCS] Rename Deltastreamer and related classes and configs (#9179) 6d5a4e2a6a7 is described below commit 6d5a4e2a6a71f9f89f901169107e23665d034440 Author: Amrish Lal AuthorDate: Tue Jul 18 08:48:22 2023 -0700 [HUDI-6520] [DOCS] Rename Deltastreamer and related classes and configs (#9179) Co-authored-by: Y Ethan Guo --- website/docs/clustering.md| 10 +- website/docs/compaction.md| 8 +- website/docs/concurrency_control.md | 20 ++-- website/docs/deployment.md| 18 +-- website/docs/docker_demo.md | 26 ++-- website/docs/faq.md | 24 ++-- website/docs/gcp_bigquery.md | 10 +- website/docs/hoodie_deltastreamer.md | 163 ++ website/docs/key_generation.md| 76 ++-- website/docs/metadata_indexing.md | 14 +-- website/docs/metrics.md | 4 +- website/docs/migration_guide.md | 6 +- website/docs/precommit_validator.md | 2 +- website/docs/querying_data.md | 2 +- website/docs/quick-start-guide.md | 2 +- website/docs/s3_hoodie.md | 2 +- website/docs/syncing_aws_glue_data_catalog.md | 2 +- website/docs/syncing_datahub.md | 10 +- website/docs/syncing_metastore.md | 2 +- website/docs/transforms.md| 6 +- website/docs/use_cases.md | 2 +- website/docs/write_operations.md | 2 +- website/docs/writing_data.md | 2 +- 23 files changed, 213 insertions(+), 200 deletions(-) diff --git a/website/docs/clustering.md b/website/docs/clustering.md index d2ceb196d02..8eb0dfbfaa1 100644 --- a/website/docs/clustering.md +++ b/website/docs/clustering.md @@ -283,17 +283,17 @@ hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run hoodie.clustering.plan.strategy.sort.columns=column1,column2 ``` -### HoodieDeltaStreamer +### HoodieStreamer -This brings us to our users' favorite utility in Hudi. Now, we can trigger asynchronous clustering with DeltaStreamer. +This brings us to our users' favorite utility in Hudi. Now, we can trigger asynchronous clustering with Hudi Streamer. Just set the `hoodie.clustering.async.enabled` config to true and specify other clustering config in properties file -whose location can be pased as `—props` when starting the deltastreamer (just like in the case of HoodieClusteringJob). +whose location can be pased as `—props` when starting the Hudi Streamer (just like in the case of HoodieClusteringJob). -A sample spark-submit command to setup HoodieDeltaStreamer is as below: +A sample spark-submit command to setup HoodieStreamer is as below: ```bash spark-submit \ ---class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ +--class org.apache.hudi.utilities.streamer.HoodieStreamer \ /path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.9.0-SNAPSHOT.jar \ --props /path/to/config/clustering_kafka.properties \ --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \ diff --git a/website/docs/compaction.md b/website/docs/compaction.md index a6249b7ae7c..9f7b119db43 100644 --- a/website/docs/compaction.md +++ b/website/docs/compaction.md @@ -45,14 +45,14 @@ import org.apache.spark.sql.streaming.ProcessingTime; writer.trigger(new ProcessingTime(3)).start(tablePath); ``` -### DeltaStreamer Continuous Mode -Hudi DeltaStreamer provides continuous ingestion mode where a single long running spark application +### Hudi Streamer Continuous Mode +Hudi Streamer provides continuous ingestion mode where a single long running spark application ingests data to Hudi table continuously from upstream sources. In this mode, Hudi supports managing asynchronous compactions. Here is an example snippet for running in continuous mode with async compactions ```properties spark-submit --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.6.0 \ ---class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ +--class org.apache.hudi.utilities.streamer.HoodieStreamer \ --table-type MERGE_ON_READ \ --target-base-path \ --target-table \ @@ -76,7 +76,7 @@ you may want Synchronous compaction, which means that as a commit is written it Compaction is run synchronously by passing the flag "--disable-compaction" (Meaning to disable async compaction scheduling). When both ingestion and compaction is running in the same spark context, you can use resource allocation configuration -in DeltaStreamer CLI s
[GitHub] [hudi] yihua merged pull request #9179: [HUDI-6520] [DOCS] Rename Deltastreamer and related classes and configs
yihua merged PR #9179: URL: https://github.com/apache/hudi/pull/9179 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6555) Big Query Sync failing with Class Not Found Exception for DeleteRecord
[ https://issues.apache.org/jira/browse/HUDI-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Goenka updated HUDI-6555: Priority: Blocker (was: Major) > Big Query Sync failing with Class Not Found Exception for DeleteRecord > -- > > Key: HUDI-6555 > URL: https://issues.apache.org/jira/browse/HUDI-6555 > Project: Apache Hudi > Issue Type: Bug > Components: meta-sync >Reporter: Aditya Goenka >Priority: Blocker > Fix For: 0.14.0 > > > With version 0.13 , BQ sync is failing with error `Caused by: > com.esotericsoftware.kryo.KryoException: Unable to find class: > [Lorg.apache.hudi.common.model.DeleteRecord;` during KryoSerialization phase. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6556) Big Query sync with master code failing for partitioned table with the Exception
Aditya Goenka created HUDI-6556: --- Summary: Big Query sync with master code failing for partitioned table with the Exception Key: HUDI-6556 URL: https://issues.apache.org/jira/browse/HUDI-6556 Project: Apache Hudi Issue Type: Bug Components: meta-sync Reporter: Aditya Goenka While doing Big Query Sync for partitioned table, its failing with below Exception - error message: Failed to add partition key partitionpath (type: TYPE_STRING) to schema, because another column with the same name was already present. This is not allowed. Full partition schema: [partitionpath:TYPE_STRING]." -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.
hudi-bot commented on PR #9203: URL: https://github.com/apache/hudi/pull/9203#issuecomment-1640477276 ## CI report: * 585935c37efc35994dd721ba2d8f05c9cf775470 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18667) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18665) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6555) Big Query Sync failing with Class Not Found Exception for DeleteRecord
Aditya Goenka created HUDI-6555: --- Summary: Big Query Sync failing with Class Not Found Exception for DeleteRecord Key: HUDI-6555 URL: https://issues.apache.org/jira/browse/HUDI-6555 Project: Apache Hudi Issue Type: Bug Components: meta-sync Reporter: Aditya Goenka Fix For: 0.14.0 With version 0.13 , BQ sync is failing with error `Caused by: com.esotericsoftware.kryo.KryoException: Unable to find class: [Lorg.apache.hudi.common.model.DeleteRecord;` during KryoSerialization phase. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
hudi-bot commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640475603 ## CI report: * 2d4e285ba5ef3c5b07ec91af6ab3a2669d2b485d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17565) * 8b6fba8468a155d39a66dc57acb6ac8c5e29b294 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.
hudi-bot commented on PR #9203: URL: https://github.com/apache/hudi/pull/9203#issuecomment-1640457527 ## CI report: * 585935c37efc35994dd721ba2d8f05c9cf775470 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18665) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18667) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] KnightChess commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
KnightChess commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640444082 @yihua @codope @nsivabalan can you help review it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] KnightChess commented on pull request #8856: [HUDI-6300] fix file size parallelism not work when init metadata table
KnightChess commented on PR #8856: URL: https://github.com/apache/hudi/pull/8856#issuecomment-1640440727 rebase master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6544] Remove unnecessary merge for bootstrap files in merge helper (#9216)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new be4dfccbb24 [HUDI-6544] Remove unnecessary merge for bootstrap files in merge helper (#9216) be4dfccbb24 is described below commit be4dfccbb24794dfac3714818971229870d24a2c Author: Jon Vexler AuthorDate: Tue Jul 18 11:20:57 2023 -0400 [HUDI-6544] Remove unnecessary merge for bootstrap files in merge helper (#9216) --- .../hudi/table/action/commit/HoodieMergeHelper.java | 15 --- 1 file changed, 4 insertions(+), 11 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java index 893ee3fc032..4df767b5e41 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java @@ -18,7 +18,6 @@ package org.apache.hudi.table.action.commit; -import org.apache.hudi.client.utils.ClosableMergingIterator; import org.apache.hudi.common.config.HoodieCommonConfig; import org.apache.hudi.common.model.HoodieBaseFile; import org.apache.hudi.common.model.HoodieRecord; @@ -109,11 +108,6 @@ public class HoodieMergeHelper extends BaseMergeHelper { try { ClosableIterator recordIterator; - - // In case writer's schema is simply a projection of the reader's one we can read - // the records in the projected schema directly - ClosableIterator baseFileRecordIterator = - baseFileReader.getRecordIterator(isPureProjection ? writerSchema : readerSchema); Schema recordSchema; if (baseFile.getBootstrapBaseFile().isPresent()) { Path bootstrapFilePath = new Path(baseFile.getBootstrapBaseFile().get().getPath()); @@ -124,13 +118,12 @@ public class HoodieMergeHelper extends BaseMergeHelper { mergeHandle.getPartitionFields(), mergeHandle.getPartitionValues()); recordSchema = mergeHandle.getWriterSchemaWithMetaFields(); -recordIterator = new ClosableMergingIterator<>( -baseFileRecordIterator, -(ClosableIterator) bootstrapFileReader.getRecordIterator(recordSchema), -(left, right) -> left.joinWith(right, recordSchema)); +recordIterator = (ClosableIterator) bootstrapFileReader.getRecordIterator(recordSchema); } else { -recordIterator = baseFileRecordIterator; +// In case writer's schema is simply a projection of the reader's one we can read +// the records in the projected schema directly recordSchema = isPureProjection ? writerSchema : readerSchema; +recordIterator = baseFileReader.getRecordIterator(recordSchema); } boolean isBufferingRecords = ExecutorFactory.isBufferingRecords(writeConfig);
[GitHub] [hudi] KnightChess commented on a diff in pull request #9212: [HUDI-6541] Multiple writers should create new and different instant time to avoid marker conflict of same instant
KnightChess commented on code in PR #9212: URL: https://github.com/apache/hudi/pull/9212#discussion_r1266922879 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java: ## @@ -862,11 +866,29 @@ public String startCommit(String actionType, HoodieTableMetaClient metaClient) { CleanerUtils.rollbackFailedWrites(config.getFailedWritesCleanPolicy(), HoodieTimeline.COMMIT_ACTION, () -> tableServiceClient.rollbackFailedWrites()); -String instantTime = HoodieActiveTimeline.createNewInstantTime(); +String instantTime = createCommit(); Review Comment: Agree, the cost of lock is too high. when we fill back the history partitions in diff job, it will cost a lot of time to obtain it ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java: ## @@ -862,11 +866,29 @@ public String startCommit(String actionType, HoodieTableMetaClient metaClient) { CleanerUtils.rollbackFailedWrites(config.getFailedWritesCleanPolicy(), HoodieTimeline.COMMIT_ACTION, () -> tableServiceClient.rollbackFailedWrites()); -String instantTime = HoodieActiveTimeline.createNewInstantTime(); +String instantTime = createCommit(); startCommit(instantTime, actionType, metaClient); return instantTime; } + /** + * Creates a new commit time for a write operation (insert/update/delete/insert_overwrite/insert_overwrite_table). + * + * @return Instant time to be generated. + */ + public String createCommit() { +if (config.getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) { + try { +lockManager.lock(); +return HoodieActiveTimeline.createNewInstantTime(); Review Comment: Some other table services will use this method directly, there may be similar problems -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9203: [HUDI-6315] Feature flag for disabling prepped merge.
hudi-bot commented on PR #9203: URL: https://github.com/apache/hudi/pull/9203#issuecomment-1640372277 ## CI report: * 585935c37efc35994dd721ba2d8f05c9cf775470 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org