[GitHub] [hudi] hudi-bot commented on pull request #9063: [HUDI-6448] Improve upgrade/downgrade for table ver. 6
hudi-bot commented on PR #9063: URL: https://github.com/apache/hudi/pull/9063#issuecomment-1613601746 ## CI report: * 4775dce07f2f3237b32f22b360f3423b1eafce85 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18191) * af66542fd96990611c79e90c943a18341442 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18203) * 2aafcc1737e74d9569531d5efc5faf8c5d1b33ec UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] MathurCodes1 opened a new issue, #9096: [SUPPORT] Unable to alter column name for a Hudi table.
MathurCodes1 opened a new issue, #9096: URL: https://github.com/apache/hudi/issues/9096 **Describe the problem you faced** I'm unable to alter the column name of Hudi table . spark.sql("ALTER TABLE customer_db.customer RENAME COLUMN subid TO subidentifier") unbable to change the column name. A clear and concise description of the problem. I'm unable to alter the column name of Hudi table . spark.sql("ALTER TABLE customer_db.customer RENAME COLUMN subid TO subidentifier") code is unable to change the column name. Getting the following error when trying to change the column using above code: **RENAME COLUMN is only supported with v2 tables** **To Reproduce** ``` import com.amazonaws.services.glue.GlueContext import com.amazonaws.services.glue.util.{GlueArgParser, Job} import org.apache.hudi.DataSourceWriteOptions import org.apache.spark.sql.functions._ import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession} import org.apache.spark.{SparkConf, SparkContext} import scala.collection.JavaConverters._ import scala.collection.mutable object ReportingJob { var spark: SparkSession = _ var glueContext: GlueContext = _ def main(inputParams: Array[String]): Unit = { val args: Map[String, String] = GlueArgParser.getResolvedOptions(inputParams, Seq("JOB_NAME").toArray) val sysArgs: mutable.Map[String, String] = scala.collection.mutable.Map(args.toSeq: _*) implicit val glueContext: GlueContext = init(sysArgs) implicit val spark: SparkSession = glueContext.getSparkSession import spark.implicits._ val partitionColumnName: String = "id" val hudiTableName: String = "Customer" val preCombineKey: String = "id" val recordKey = "id" val basePath= "s3://aws-amazon-uk/customer/production/" val df= Seq((123,"1","seq1"),(124,"0","seq2")).toDF("id","subid","subseq") val hudiCommonOptions: Map[String, String] = Map( "hoodie.table.name" -> hudiTableName, "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.precombine.field" -> preCombineKey, "hoodie.datasource.write.recordkey.field" -> recordKey, "hoodie.datasource.write.operation" -> "bulk_insert", //"hoodie.datasource.write.operation" -> "upsert", "hoodie.datasource.write.row.writer.enable" -> "true", "hoodie.datasource.write.reconcile.schema" -> "true", "hoodie.datasource.write.partitionpath.field" -> partitionColumnName, "hoodie.datasource.write.hive_style_partitioning" -> "true", // "hoodie.bulkinsert.shuffle.parallelism" -> "2000", // "hoodie.upsert.shuffle.parallelism" -> "400", "hoodie.datasource.hive_sync.enable" -> "true", "hoodie.datasource.hive_sync.table" -> hudiTableName, "hoodie.datasource.hive_sync.database" -> "customer_db", "hoodie.datasource.hive_sync.partition_fields" -> partitionColumnName, "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.use_jdbc" -> "false", "hoodie.combine.before.upsert" -> "true", "hoodie.avro.schema.external.transformation" -> "true", "hoodie.schema.on.read.enable" -> "true", "hoodie.datasource.write.schema.allow.auto.evolution.column.drop" -> "true", "hoodie.index.type" -> "BLOOM", "spark.hadoop.parquet.avro.write-old-list-structure" -> "false", DataSourceWriteOptions.TABLE_TYPE.key() -> "COPY_ON_WRITE" ) df.write.format("org.apache.hudi") .options(hudiCommonOptions) .mode(SaveMode.Overwrite) .save(basePath+hudiTableName) spark.sql("ALTER TABLE customer_db.customer RENAME COLUMN subid TO subidentifier") commit() } def commit(): Unit = { Job.commit() } def init(sysArgs: mutable.Map[String, String]): GlueContext = { val conf = new SparkConf() conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED") conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED") conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED") conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED") conf.set("spark.sql.avro.datetimeRebaseModeInRead", "CORRECTED") val sparkContext = new SparkContext(conf) glueContext = new GlueContext(sparkContext) Job.init(sysArgs("JOB_NAME"), glueContext, sysArgs.asJava) glueContext
[GitHub] [hudi] hudi-bot commented on pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
hudi-bot commented on PR #8837: URL: https://github.com/apache/hudi/pull/8837#issuecomment-1613593548 ## CI report: * 9751b6399ebf6b629f3940d612bdfe2e2005a25f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18172) * 50a92342798b808ebe521d82b99e4622eeb77ce8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18207) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
nsivabalan commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1246951472 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -209,9 +211,10 @@ public class HoodieMetadataPayload implements HoodieRecordPayload orderingVal) { -this(Option.of(record)); + public HoodieMetadataPayload(@Nullable GenericRecord record, Comparable orderingVal) { +this(Option.ofNullable(record)); Review Comment: https://github.com/apache/hudi/blob/dc3aa399ffc4875abba7be5833ebabca222eb6ff/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java#L292 We issue deletes to RLI using EmptyRecordPayload which goes in as a Delete Log BLock. When we deserialize (read path) this, it goes here where we try to instantiate the resp payload using reflection. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
nsivabalan commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1246951472 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -209,9 +211,10 @@ public class HoodieMetadataPayload implements HoodieRecordPayload orderingVal) { -this(Option.of(record)); + public HoodieMetadataPayload(@Nullable GenericRecord record, Comparable orderingVal) { +this(Option.ofNullable(record)); Review Comment: https://github.com/apache/hudi/blob/dc3aa399ffc4875abba7be5833ebabca222eb6ff/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java#L292 We issue deletes to RLI using EmptyRecordPayload which goes in as a Delete Log BLock. When we deserialize this, it goes here where we try to instantiate the resp payload using reflection. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
nsivabalan commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1246949788 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -283,6 +285,8 @@ public HoodieMetadataPayload(Option recordOpt) { Integer.parseInt(recordIndexRecord.get(RECORD_INDEX_FIELD_FILE_INDEX).toString()), Long.parseLong(recordIndexRecord.get(RECORD_INDEX_FIELD_INSTANT_TIME).toString())); } +} else { + this.isDeletedRecord = true; Review Comment: hey folks. here is the context. I feel we should go this route. may be there are opportunities to optimize col stats and bloom filter records as well. Generally, for any payload, we should have a key and a top level field preferrably to denote isDeleted. So, if entire records needs to be deleted, we can rely on the top level isDelete field. This is unavoidable since we write using EmptyHoodieRecordPayload in some flows (delete), but read back using specific payload class. So, every payload will have to support deserialize an EmptyRecordPayload. Now, lets go into more specifics. RLI: Commit1: add key1 to RLI partition. rolling back commit1: delete key1 from RLI partition. From a HoodieRecord standpoint, its as simple as adding a new entry and deleting the same. Its simpler and our getInsertValue or combineAndGetUpdateValue will be fast. If we push the isDeleted to HoodieRecordIndexInfo, then we need to explicitly set the type and then parse the HoodieRecordIndexInfo data and then deduce that its deleted. Again, w/ EmptyRecordPayload, this is not even doable and we have to go with this. Why we did not have this issue before. apparently, with FILES, the keys are partitions, and hence, except delete_partition, no records from FILES will be deleted in its entirely. W/ col stats, a delete, while writing to MDT partition, is yet another upsert record with isDeleted within ColumnStats Metadata. So, our getInsertValue or combineAndGetUpdate value will need to deserialize entire record and then deduce that its deleted. A right fix here also would be to do what we are doing w/ RLI in this patch. i.e. in commit1, add col1_part1_file1 : value to MDT in some X commit, when file1 is deleted: just delete col1_part1_file1 from col stats partition in MDT, by using EmptyRecordPayload. So, Log record reading and compaction will be fast. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6426) Upgrade Spark 3.4.1
[ https://issues.apache.org/jira/browse/HUDI-6426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udit Mehrotra updated HUDI-6426: Fix Version/s: 0.14.0 Priority: Blocker (was: Major) > Upgrade Spark 3.4.1 > --- > > Key: HUDI-6426 > URL: https://issues.apache.org/jira/browse/HUDI-6426 > Project: Apache Hudi > Issue Type: Task >Reporter: Rahil Chertara >Priority: Blocker > Fix For: 0.14.0 > > > Spark 3.4.1 rc1 is out [https://github.com/apache/spark/tree/v3.4.1-rc1] we > should start the upgrade process for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #8837: [HUDI-6153] Changed the rollback mechanism for MDT to actual rollbacks rather than appending revert blocks.
hudi-bot commented on PR #8837: URL: https://github.com/apache/hudi/pull/8837#issuecomment-1613571050 ## CI report: * 9751b6399ebf6b629f3940d612bdfe2e2005a25f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18172) * 50a92342798b808ebe521d82b99e4622eeb77ce8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8609: [HUDI-6154] Introduced retry while reading hoodie.properties to deal with parallel updates.
hudi-bot commented on PR #8609: URL: https://github.com/apache/hudi/pull/8609#issuecomment-1613560816 ## CI report: * a64034d612fa64c99dd8d319ac00680924773f53 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18197) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] splate commented on pull request #3391: [HUDI-83] Fix Timestamp/Date type read by Hive3
splate commented on PR #3391: URL: https://github.com/apache/hudi/pull/3391#issuecomment-1613554365 Would this bug also exist in the spark hudi libraries used in AWS glue? My issue is I am trying to use Spark SQL to query a hudi table and put it into a spark dataframe. I am getting a casting exception ("java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable"). Would this be related to this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9066: [HUDI-6452] Add MOR snapshot reader to integrate with query engines without using Hadoop APIs
hudi-bot commented on PR #9066: URL: https://github.com/apache/hudi/pull/9066#issuecomment-1613551942 ## CI report: * 60c1b8c5885fdda28e07f3ba79290f01dc60a9c4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18196) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (05435bb0344 -> dc3aa399ffc)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 05435bb0344 [MINOR] Increase timeout for Azure CI: UT spark-datasource to 240 minutes (#9089) add dc3aa399ffc [HUDI-6393] Enable MOR support for Record index with functional test cases (#9017) No new revisions were added by this update. Summary of changes: .../metadata/HoodieBackedTableMetadataWriter.java | 5 - .../hudi/metadata/HoodieBackedTableMetadata.java | 4 + .../hudi/functional/TestRecordLevelIndex.scala | 608 + .../org/apache/hudi/util/JavaConversions.scala | 23 +- 4 files changed, 625 insertions(+), 15 deletions(-) create mode 100644 hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestRecordLevelIndex.scala copy hudi-utilities/src/main/java/org/apache/hudi/utilities/exception/HoodieIncrementalPullException.java => hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/util/JavaConversions.scala (65%)
[GitHub] [hudi] xushiyan merged pull request #9017: [HUDI-6393] Enable MOR support for Record index with functional test cases
xushiyan merged PR #9017: URL: https://github.com/apache/hudi/pull/9017 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on pull request #9017: [HUDI-6393] Enable MOR support for Record index with functional test cases
xushiyan commented on PR #9017: URL: https://github.com/apache/hudi/pull/9017#issuecomment-1613510452 CI is timing out as expected. The newly added testcase is passing. will land this now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan merged pull request #9089: [MINOR] Increase timeout for Azure CI: UT spark-datasource to 240 minutes
nsivabalan merged PR #9089: URL: https://github.com/apache/hudi/pull/9089 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (8def3e68ae5 -> 05435bb0344)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 8def3e68ae5 [MINOR] Improve CollectionUtils helper methods (#9088) add 05435bb0344 [MINOR] Increase timeout for Azure CI: UT spark-datasource to 240 minutes (#9089) No new revisions were added by this update. Summary of changes: azure-pipelines-20230430.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
[GitHub] [hudi] hudi-bot commented on pull request #9083: PKLess Merge Into
hudi-bot commented on PR #9083: URL: https://github.com/apache/hudi/pull/9083#issuecomment-1613500714 ## CI report: * be6801e9ca41f00576a511c7d3ffe144e90717ee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18179) * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN * 767eb9cc26d98ed8e64632f98ab688aa4145e5aa Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18204) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
hudi-bot commented on PR #9058: URL: https://github.com/apache/hudi/pull/9058#issuecomment-1613500487 ## CI report: * 1697d1bfa095ca16a9361e3728a77331d3a28037 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18195) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9083: PKLess Merge Into
hudi-bot commented on PR #9083: URL: https://github.com/apache/hudi/pull/9083#issuecomment-1613490095 ## CI report: * be6801e9ca41f00576a511c7d3ffe144e90717ee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18179) * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN * 767eb9cc26d98ed8e64632f98ab688aa4145e5aa UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9063: [HUDI-6448] Improve upgrade/downgrade for table ver. 6
hudi-bot commented on PR #9063: URL: https://github.com/apache/hudi/pull/9063#issuecomment-1613489865 ## CI report: * 4775dce07f2f3237b32f22b360f3423b1eafce85 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18191) * af66542fd96990611c79e90c943a18341442 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18203) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9063: [HUDI-6448] Improve upgrade/downgrade for table ver. 6
hudi-bot commented on PR #9063: URL: https://github.com/apache/hudi/pull/9063#issuecomment-1613476228 ## CI report: * 4775dce07f2f3237b32f22b360f3423b1eafce85 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18191) * af66542fd96990611c79e90c943a18341442 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9083: PKLess Merge Into
hudi-bot commented on PR #9083: URL: https://github.com/apache/hudi/pull/9083#issuecomment-1613476471 ## CI report: * be6801e9ca41f00576a511c7d3ffe144e90717ee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18179) * 3a0bfb88049cf2c0f8afe5c925dbd76fa6f7cd89 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9017: [HUDI-6393] Enable MOR support for Record index with functional test cases
hudi-bot commented on PR #9017: URL: https://github.com/apache/hudi/pull/9017#issuecomment-1613475988 ## CI report: * ceffe7d8146f48e1c6c083613646463c1404a77f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18194) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] BBency opened a new issue, #9094: Async Clustering failing with errors for MOR table
BBency opened a new issue, #9094: URL: https://github.com/apache/hudi/issues/9094 **Problem Description** We have a MOR table which is partitioned by yearmonth(MM). We would like to trigger async clustering after doing the compaction at the end of the day so that we can stitch together small files into larger files. Async clustering for the table is failing. Below are the different approaches I tried and the error messages I got. **Hudi Config Used** ``` "hoodie.table.name" -> hudiTableName, "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator", "hoodie.datasource.write.precombine.field" -> preCombineKey, "hoodie.datasource.write.recordkey.field" -> recordKey, "hoodie.datasource.write.operation" -> writeOperation, "hoodie.datasource.write.row.writer.enable" -> "true", "hoodie.datasource.write.reconcile.schema" -> "true", "hoodie.datasource.write.partitionpath.field" -> partitionColumnName, "hoodie.datasource.write.hive_style_partitioning" -> "true", "hoodie.bulkinsert.sort.mode" -> "GLOBAL_SORT", "hoodie.datasource.hive_sync.enable" -> "true", "hoodie.datasource.hive_sync.table" -> hudiTableName, "hoodie.datasource.hive_sync.database" -> databaseName, "hoodie.datasource.hive_sync.partition_fields" -> partitionColumnName, "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.use_jdbc" -> "false", "hoodie.combine.before.upsert" -> "true", "hoodie.index.type" -> "BLOOM", "spark.hadoop.parquet.avro.write-old-list-structure" -> "false" "hoodie.datasource.write.table.type" -> "MERGE_ON_READ" "hoodie.compact.inline" -> "false", "hoodie.compact.schedule.inline" -> "true", "hoodie.compact.inline.trigger.strategy" -> "NUM_COMMITS", "hoodie.compact.inline.max.delta.commits" -> "5", "hoodie.cleaner.policy" -> "KEEP_LATEST_COMMITS", "hoodie.cleaner.commits.retained" -> "3", "hoodie.clustering.async.enabled" -> "true", "hoodie.clustering.async.max.commits" -> "2", "hoodie.clustering.execution.strategy.class" -> "org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy", "hoodie.clustering.plan.strategy.sort.columns" -> recordKey, "hoodie.clustering.plan.strategy.small.file.limit" -> "67108864", "hoodie.clustering.plan.strategy.target.file.max.bytes" -> "134217728", "hoodie.clustering.plan.strategy.max.bytes.per.group" -> "2147483648", "hoodie.clustering.plan.strategy.max.num.groups" -> "150", "hoodie.clustering.preserve.commit.metadata" -> "true" ``` **Approaches Tried** 1. Triggered a clustering job with running mode as scheduleAndExecute **Code Used** ``` val hudiClusterConfig = new HoodieClusteringJob.Config hudiClusterConfig.basePath = hudiClusterConfig.tableName = hudiClusterConfig.runningMode = "scheduleAndExecute" hudiClusterConfig.retryLastFailedClusteringJob = true val configList: util.List[String] = new util.ArrayList() configList.add("hoodie.clustering.async.enabled=true") configList.add("hoodie.clustering.async.max.commits=2") configList.add("hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy") configList.add("hoodie.clustering.plan.strategy.sort.columns=") configList.add("hoodie.clustering.plan.strategy.small.file.limit=67108864") configList.add("hoodie.clustering.plan.strategy.target.file.max.bytes=134217728") configList.add("hoodie.clustering.plan.strategy.max.bytes.per.group=2147483648") configList.add("hoodie.clustering.plan.strategy.max.num.groups=150") configList.add("hoodie.clustering.preserve.commit.metadata=true") hudiClusterConfig.configs = configList val hudiClusterJob = new HoodieClusteringJob(jsc, hudiClusterConfig) val clusterStatus = hudiClusterJob.cluster(1) println(clusterStatus) ``` **Stacktrace** ShuffleMapStage 87 (sortBy at RDDCustomColumnsSortPartitioner.java:64) failed in 1.098 s due to Job aborted due to stage failure: task 0.0 in stage 28.0 (TID 367) had a not serializable result: org.apache.avro.generic.GenericData$Record Serialization stack: - object not serializable (class: org.apache.avro.generic.GenericData$Record, value: 2. Used the procedure run_clustering to schedule and trigger clustering. We found that the replacecommit created through the procedure run had lesser data compared to what it was created when scheduled from the code in approach 1 **Code Used** ```query_run_clustering = f"call run_clustering(path => '{path}')" spark_df_run_clustering = spark.sql(query_run_clustering) spark_df_run_clustering.show() ``` **Stacktrace** An error
[GitHub] [hudi] xushiyan commented on pull request #9063: [HUDI-6448] Improve upgrade/downgrade for table ver. 6
xushiyan commented on PR #9063: URL: https://github.com/apache/hudi/pull/9063#issuecomment-1613418110 manually verified the flow 0.13.1 -> 0.14.0-SNAPSHOT (this PR) before upgrade ``` hoodie.table.version=5 hoodie.table.metadata.partitions=files ``` upgrade ``` ./hudi-cli.sh connect --path /tmp/hudi_trips_13_1_to_14_0_COPY_ON_WRITE upgrade table --toVersion 6 --sparkMaster 'local[2]' ``` after upgrade ``` hoodie.table.version=6 hoodie.table.metadata.partitions=files ``` write data with RLI enabled ``` hoodie.table.version=6 hoodie.table.metadata.partitions=files,record_index ``` RLI partition and hfiles created downgrade ``` downgrade table --toVersion 5 --sparkMaster 'local[2]' ``` after downgrade ``` hoodie.table.version=5 hoodie.table.metadata.partitions=files ``` RLI partition is removed ``` ➜ ll /tmp/hudi_trips_13_1_to_14_0_COPY_ON_WRITE/.hoodie/metadata/record_index ls: /tmp/hudi_trips_13_1_to_14_0_COPY_ON_WRITE/.hoodie/metadata/record_index: No such file or directory ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #9079: [SUPPORT] Hudi delete not working when using UuidKeyGenerator
nsivabalan commented on issue #9079: URL: https://github.com/apache/hudi/issues/9079#issuecomment-1613408548 this is a known limitation of UUID Key generator. This key gen is generally meant to be used only for immutable data. with 0.14.0, we are adding pk less(primary key less) table, you can use spark-sql DELETES to delete records. but this is coming in 0.14.0 and we don't have any such support in prior versions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] noahtaite commented on issue #9067: [SUPPORT] Manual Glue sync for large, highly partitioned table failing
noahtaite commented on issue #9067: URL: https://github.com/apache/hudi/issues/9067#issuecomment-1613377080 Hello @danny0405 @ad1happy2go I can confirm 0.13.1 works nicely as the HMS sync mode now supports batching and boolean values (conditional sync). thank you for the support -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] noahtaite closed issue #9067: [SUPPORT] Manual Glue sync for large, highly partitioned table failing
noahtaite closed issue #9067: [SUPPORT] Manual Glue sync for large, highly partitioned table failing URL: https://github.com/apache/hudi/issues/9067 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] gamblewin opened a new issue, #9093: [SUPPORT] Is it allowed using Flink Table API sqlQuery() to read data from hudi tables?
gamblewin opened a new issue, #9093: URL: https://github.com/apache/hudi/issues/9093 **Describe the problem you faced** I'm trying to use flink table api sqlQuery to read data from hudi table but not working, so am i doing it wrong or hudi doesn't support this way to query data. **Code** ```java sEnv = StreamExecutionEnvironment.getExecutionEnvironment(); sTableEnv = StreamTableEnvironment.create(sEnv); sEnv.setParallelism(1); sEnv.enableCheckpointing(3000); // create table String createTabelSql = "create table dept(\n" + " dept_id BIGINT PRIMARY KEY NOT ENFORCED,\n" + " dept_name varchar(10),\n" + " ts timestamp(3)\n" + ")\n" + "with (\n" + " 'connector' = 'hudi',\n" + " 'path' = 'hdfs://localhost:9000/hudi/dept',\n" + " 'table.type' = 'MERGE_ON_READ'\n" + ")"; sTableEnv.executeSql(createTabelSql); // insert data sTableEnv.executeSql("insert into dept values (1, 'a', NOW()), (2, 'b', NOW())"); // query data Table table = sTableEnv.sqlQuery("select * from dept"); DataStream dataStream = sTableEnv.toDataStream(table); // there's nothing to print dataStream.print(); ``` **Environment Description** * Hudi version : 1.12.0 * Hadoop version : 3.1.3 * Flink version: 1.13.6 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant
hudi-bot commented on PR #9038: URL: https://github.com/apache/hudi/pull/9038#issuecomment-1613357939 ## CI report: * a65a29c0cf1c8feb9f39e168ba80c99ebcae1c5d UNKNOWN * 34f8823f48712c57058bc37c8936a276c1457557 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18187) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18193) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18188) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ad1happy2go commented on issue #9086: [SUPPORT]How to build with scala 2.11 for spark and scala2.12 for flink
ad1happy2go commented on issue #9086: URL: https://github.com/apache/hudi/issues/9086#issuecomment-1613339717 @bigdata-spec I dont think we can build with different scala version in a single build. You may need to build it twice and then use the spark and flink jars from separate artifacts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ad1happy2go commented on issue #9091: [BUG] Use NonpartitionedKeyGenerator WriteOperationType BULK_INSERT and UPSERT get different _hoodie_record_key format
ad1happy2go commented on issue #9091: URL: https://github.com/apache/hudi/issues/9091#issuecomment-1613328306 @lipusheng Known Issue which got fixed in hudi 0.13.X. Refer this GitHub issue - https://github.com/apache/hudi/issues/8981 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9082: [HUDI-6445] Distribute spark ds func tests
hudi-bot commented on PR #9082: URL: https://github.com/apache/hudi/pull/9082#issuecomment-1613268792 ## CI report: * c529c624afdca331514a2bdfb78cc6e18ab9f57a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18185) * 474ce7e9a78909fe90b0641f7be1b059084bb11a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18202) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9082: [HUDI-6445] Distribute spark ds func tests
hudi-bot commented on PR #9082: URL: https://github.com/apache/hudi/pull/9082#issuecomment-1613254158 ## CI report: * c529c624afdca331514a2bdfb78cc6e18ab9f57a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18185) * 474ce7e9a78909fe90b0641f7be1b059084bb11a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9041: [HUDI-6431] Support update partition path in record-level index
hudi-bot commented on PR #9041: URL: https://github.com/apache/hudi/pull/9041#issuecomment-1613253794 ## CI report: * b681df04a7ad0febbcd9235622c2ee7f98759cf9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18082) * 2f139383c54f93669342539af77dca9b3a352be3 UNKNOWN * a1458e17e5749a89948be8f60387eeecd4c0f87c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18201) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] KenjiFujima commented on pull request #8933: [HUDI-5329] Spark reads table error when Flink creates table without record key and primary key
KenjiFujima commented on PR #8933: URL: https://github.com/apache/hudi/pull/8933#issuecomment-1613251280 @danny0405, I have addressed above comments. PTAL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #9041: [HUDI-6431] Support update partition path in record-level index
xushiyan commented on code in PR #9041: URL: https://github.com/apache/hudi/pull/9041#discussion_r1246658153 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -459,11 +459,6 @@ private Pair> initializeRecordIndexPartition() final HoodieMetadataFileSystemView fsView = new HoodieMetadataFileSystemView(dataMetaClient, dataMetaClient.getActiveTimeline(), metadata); -// MOR tables are not supported -if (!dataMetaClient.getTableType().equals(HoodieTableType.COPY_ON_WRITE)) { - throw new HoodieMetadataException("Only COW tables are supported with record index"); -} - Review Comment: this change will be included in functional test PR (which should be merged first). i include it here for CI to pass. when merging, this diff should be auto-resolved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #9041: [HUDI-6431] Support update partition path in record-level index
xushiyan commented on code in PR #9041: URL: https://github.com/apache/hudi/pull/9041#discussion_r1246655960 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -310,6 +312,56 @@ public static HoodieData> mergeForPartitionUpdates( return Arrays.asList(deleteRecord, getTaggedRecord(merged, Option.empty())).iterator(); } }); -return taggedUpdatingRecords.union(newRecords); +return taggedUpdatingRecords.union(taggedNewRecords); + } + + public static HoodieData> tagGlobalLocationBackToRecords( + HoodieData> incomingRecords, + HoodiePairData keyAndExistingLocations, + boolean mayContainDuplicateLookup, + boolean shouldUpdatePartitionPath, + HoodieWriteConfig config, + HoodieTable table) { +final HoodieRecordMerger merger = config.getRecordMerger(); + +HoodiePairData> keyAndIncomingRecords = +incomingRecords.mapToPair(record -> Pair.of(record.getRecordKey(), record)); + +// Pair of incoming record and the global location if meant for merged lookup in later stage +HoodieData, Option>> incomingRecordsAndLocations += keyAndIncomingRecords.leftOuterJoin(keyAndExistingLocations).values() +.map(v -> { + final HoodieRecord incomingRecord = v.getLeft(); + Option currentLocOpt = Option.ofNullable(v.getRight().orElse(null)); + if (currentLocOpt.isPresent()) { +HoodieRecordGlobalLocation currentLoc = currentLocOpt.get(); +boolean shouldPerformMergedLookUp = mayContainDuplicateLookup +|| !Objects.equals(incomingRecord.getPartitionPath(), currentLoc.getPartitionPath()); +if (shouldUpdatePartitionPath && shouldPerformMergedLookUp) { + return Pair.of(incomingRecord, currentLocOpt); +} else { + // - When update partition path is set to false, + // the incoming record will be tagged to the existing record's partition regardless of being equal or not. + // - When update partition path is set to true, + // the incoming record will be tagged to the existing record's partition + // when partition is not updated and the look-up won't have duplicates (e.g. COW, or using RLI). + return Pair.of((HoodieRecord) getTaggedRecord( Review Comment: this was having Option.empty() as right of the pair and it won't be merged-lookup candidates -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #9041: [HUDI-6431] Support update partition path in record-level index
codope commented on code in PR #9041: URL: https://github.com/apache/hudi/pull/9041#discussion_r1246655286 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -310,6 +312,56 @@ public static HoodieData> mergeForPartitionUpdates( return Arrays.asList(deleteRecord, getTaggedRecord(merged, Option.empty())).iterator(); } }); -return taggedUpdatingRecords.union(newRecords); +return taggedUpdatingRecords.union(taggedNewRecords); + } + + public static HoodieData> tagGlobalLocationBackToRecords( + HoodieData> incomingRecords, + HoodiePairData keyAndExistingLocations, + boolean mayContainDuplicateLookup, + boolean shouldUpdatePartitionPath, + HoodieWriteConfig config, + HoodieTable table) { +final HoodieRecordMerger merger = config.getRecordMerger(); + +HoodiePairData> keyAndIncomingRecords = +incomingRecords.mapToPair(record -> Pair.of(record.getRecordKey(), record)); + +// Pair of incoming record and the global location if meant for merged lookup in later stage +HoodieData, Option>> incomingRecordsAndLocations += keyAndIncomingRecords.leftOuterJoin(keyAndExistingLocations).values() +.map(v -> { + final HoodieRecord incomingRecord = v.getLeft(); + Option currentLocOpt = Option.ofNullable(v.getRight().orElse(null)); + if (currentLocOpt.isPresent()) { +HoodieRecordGlobalLocation currentLoc = currentLocOpt.get(); +boolean shouldPerformMergedLookUp = mayContainDuplicateLookup +|| !Objects.equals(incomingRecord.getPartitionPath(), currentLoc.getPartitionPath()); +if (shouldUpdatePartitionPath && shouldPerformMergedLookUp) { + return Pair.of(incomingRecord, currentLocOpt); +} else { + // - When update partition path is set to false, + // the incoming record will be tagged to the existing record's partition regardless of being equal or not. + // - When update partition path is set to true, + // the incoming record will be tagged to the existing record's partition + // when partition is not updated and the look-up won't have duplicates (e.g. COW, or using RLI). + return Pair.of((HoodieRecord) getTaggedRecord( + createNewHoodieRecord(incomingRecord, currentLoc, merger), Option.of(currentLoc)), + Option.empty()); +} + } else { +return Pair.of(getTaggedRecord(incomingRecord, Option.empty()), Option.empty()); + } +}); +return shouldUpdatePartitionPath +? mergeForPartitionUpdatesIfNeeded(incomingRecordsAndLocations, config, table) Review Comment: yeah we need to consider duplicates, otherwise we'll have to special-case for RLI. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -459,11 +459,6 @@ private Pair> initializeRecordIndexPartition() final HoodieMetadataFileSystemView fsView = new HoodieMetadataFileSystemView(dataMetaClient, dataMetaClient.getActiveTimeline(), metadata); -// MOR tables are not supported -if (!dataMetaClient.getTableType().equals(HoodieTableType.COPY_ON_WRITE)) { - throw new HoodieMetadataException("Only COW tables are supported with record index"); -} - Review Comment: Would prefer to land it in a separate commit. I guess #9017 will land earlier anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #9041: [HUDI-6431] Support update partition path in record-level index
xushiyan commented on code in PR #9041: URL: https://github.com/apache/hudi/pull/9041#discussion_r1241059669 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -310,6 +312,56 @@ public static HoodieData> mergeForPartitionUpdates( return Arrays.asList(deleteRecord, getTaggedRecord(merged, Option.empty())).iterator(); } }); -return taggedUpdatingRecords.union(newRecords); +return taggedUpdatingRecords.union(taggedNewRecords); + } + + public static HoodieData> tagGlobalLocationBackToRecords( + HoodieData> incomingRecords, + HoodiePairData keyAndExistingLocations, + boolean mayContainDuplicateLookup, + boolean shouldUpdatePartitionPath, + HoodieWriteConfig config, + HoodieTable table) { +final HoodieRecordMerger merger = config.getRecordMerger(); + +HoodiePairData> keyAndIncomingRecords = +incomingRecords.mapToPair(record -> Pair.of(record.getRecordKey(), record)); + +// Pair of incoming record and the global location if meant for merged lookup in later stage +HoodieData, Option>> incomingRecordsAndLocations += keyAndIncomingRecords.leftOuterJoin(keyAndExistingLocations).values() +.map(v -> { + final HoodieRecord incomingRecord = v.getLeft(); + Option currentLocOpt = Option.ofNullable(v.getRight().orElse(null)); + if (currentLocOpt.isPresent()) { +HoodieRecordGlobalLocation currentLoc = currentLocOpt.get(); +boolean shouldPerformMergedLookUp = mayContainDuplicateLookup +|| !Objects.equals(incomingRecord.getPartitionPath(), currentLoc.getPartitionPath()); +if (shouldUpdatePartitionPath && shouldPerformMergedLookUp) { + return Pair.of(incomingRecord, currentLocOpt); +} else { + // - When update partition path is set to false, + // the incoming record will be tagged to the existing record's partition regardless of being equal or not. + // - When update partition path is set to true, + // the incoming record will be tagged to the existing record's partition + // when partition is not updated and the look-up won't have duplicates (e.g. COW, or using RLI). + return Pair.of((HoodieRecord) getTaggedRecord( Review Comment: new record creation needs optimization; i have not finished it yet. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -310,6 +312,56 @@ public static HoodieData> mergeForPartitionUpdates( return Arrays.asList(deleteRecord, getTaggedRecord(merged, Option.empty())).iterator(); } }); -return taggedUpdatingRecords.union(newRecords); +return taggedUpdatingRecords.union(taggedNewRecords); + } + + public static HoodieData> tagGlobalLocationBackToRecords( + HoodieData> incomingRecords, + HoodiePairData keyAndExistingLocations, + boolean mayContainDuplicateLookup, + boolean shouldUpdatePartitionPath, + HoodieWriteConfig config, + HoodieTable table) { +final HoodieRecordMerger merger = config.getRecordMerger(); + +HoodiePairData> keyAndIncomingRecords = +incomingRecords.mapToPair(record -> Pair.of(record.getRecordKey(), record)); + +// Pair of incoming record and the global location if meant for merged lookup in later stage +HoodieData, Option>> incomingRecordsAndLocations += keyAndIncomingRecords.leftOuterJoin(keyAndExistingLocations).values() +.map(v -> { + final HoodieRecord incomingRecord = v.getLeft(); + Option currentLocOpt = Option.ofNullable(v.getRight().orElse(null)); + if (currentLocOpt.isPresent()) { +HoodieRecordGlobalLocation currentLoc = currentLocOpt.get(); +boolean shouldPerformMergedLookUp = mayContainDuplicateLookup +|| !Objects.equals(incomingRecord.getPartitionPath(), currentLoc.getPartitionPath()); +if (shouldUpdatePartitionPath && shouldPerformMergedLookUp) { + return Pair.of(incomingRecord, currentLocOpt); +} else { + // - When update partition path is set to false, + // the incoming record will be tagged to the existing record's partition regardless of being equal or not. + // - When update partition path is set to true, + // the incoming record will be tagged to the existing record's partition + // when partition is not updated and the look-up won't have duplicates (e.g. COW, or using RLI). + return Pair.of((HoodieRecord) getTaggedRecord( Review Comment: refactored -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen
[GitHub] [hudi] xushiyan commented on a diff in pull request #9041: [HUDI-6431] Support update partition path in record-level index
xushiyan commented on code in PR #9041: URL: https://github.com/apache/hudi/pull/9041#discussion_r1241059710 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -310,6 +312,56 @@ public static HoodieData> mergeForPartitionUpdates( return Arrays.asList(deleteRecord, getTaggedRecord(merged, Option.empty())).iterator(); } }); -return taggedUpdatingRecords.union(newRecords); +return taggedUpdatingRecords.union(taggedNewRecords); + } + + public static HoodieData> tagGlobalLocationBackToRecords( + HoodieData> incomingRecords, + HoodiePairData keyAndExistingLocations, + boolean mayContainDuplicateLookup, + boolean shouldUpdatePartitionPath, + HoodieWriteConfig config, + HoodieTable table) { +final HoodieRecordMerger merger = config.getRecordMerger(); + +HoodiePairData> keyAndIncomingRecords = +incomingRecords.mapToPair(record -> Pair.of(record.getRecordKey(), record)); + +// Pair of incoming record and the global location if meant for merged lookup in later stage +HoodieData, Option>> incomingRecordsAndLocations += keyAndIncomingRecords.leftOuterJoin(keyAndExistingLocations).values() +.map(v -> { + final HoodieRecord incomingRecord = v.getLeft(); + Option currentLocOpt = Option.ofNullable(v.getRight().orElse(null)); + if (currentLocOpt.isPresent()) { +HoodieRecordGlobalLocation currentLoc = currentLocOpt.get(); +boolean shouldPerformMergedLookUp = mayContainDuplicateLookup +|| !Objects.equals(incomingRecord.getPartitionPath(), currentLoc.getPartitionPath()); +if (shouldUpdatePartitionPath && shouldPerformMergedLookUp) { + return Pair.of(incomingRecord, currentLocOpt); +} else { + // - When update partition path is set to false, + // the incoming record will be tagged to the existing record's partition regardless of being equal or not. + // - When update partition path is set to true, + // the incoming record will be tagged to the existing record's partition + // when partition is not updated and the look-up won't have duplicates (e.g. COW, or using RLI). + return Pair.of((HoodieRecord) getTaggedRecord( + createNewHoodieRecord(incomingRecord, currentLoc, merger), Option.of(currentLoc)), + Option.empty()); +} + } else { +return Pair.of(getTaggedRecord(incomingRecord, Option.empty()), Option.empty()); + } +}); +return shouldUpdatePartitionPath +? mergeForPartitionUpdatesIfNeeded(incomingRecordsAndLocations, config, table) +: incomingRecordsAndLocations.map(Pair::getLeft); + } + + public static HoodieRecord createNewHoodieRecord(HoodieRecord oldRecord, HoodieRecordGlobalLocation location, HoodieRecordMerger merger) { +HoodieKey recordKey = new HoodieKey(oldRecord.getRecordKey(), location.getPartitionPath()); +return merger.getRecordType() == HoodieRecordType.AVRO Review Comment: new record creation needs optimization; i have not finished it yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #9041: [HUDI-6431] Support update partition path in record-level index
codope commented on code in PR #9041: URL: https://github.com/apache/hudi/pull/9041#discussion_r1246648546 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/simple/HoodieGlobalSimpleIndex.java: ## @@ -72,85 +68,37 @@ public HoodieData> tagLocation( protected HoodieData> tagLocationInternal( HoodieData> inputRecords, HoodieEngineContext context, HoodieTable hoodieTable) { - -HoodiePairData> keyedInputRecords = -inputRecords.mapToPair(entry -> new ImmutablePair<>(entry.getRecordKey(), entry)); -HoodiePairData allRecordLocationsInTable = -fetchAllRecordLocations(context, hoodieTable, config.getGlobalSimpleIndexParallelism()); -return getTaggedRecords(keyedInputRecords, allRecordLocationsInTable, hoodieTable); +List> latestBaseFiles = getAllBaseFilesInTable(context, hoodieTable); Review Comment: +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9063: [HUDI-6448] Improve upgrade/downgrade for table ver. 6
hudi-bot commented on PR #9063: URL: https://github.com/apache/hudi/pull/9063#issuecomment-1613184716 ## CI report: * 4775dce07f2f3237b32f22b360f3423b1eafce85 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18191) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9041: [HUDI-6431] Support update partition path in record-level index
hudi-bot commented on PR #9041: URL: https://github.com/apache/hudi/pull/9041#issuecomment-1613184534 ## CI report: * b681df04a7ad0febbcd9235622c2ee7f98759cf9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18082) * 2f139383c54f93669342539af77dca9b3a352be3 UNKNOWN * a1458e17e5749a89948be8f60387eeecd4c0f87c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6459) Add Rollback test for Record Level Index
[ https://issues.apache.org/jira/browse/HUDI-6459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lokesh Jain updated HUDI-6459: -- Summary: Add Rollback test for Record Level Index (was: Add Rollback validation for Record Level Index) > Add Rollback test for Record Level Index > > > Key: HUDI-6459 > URL: https://issues.apache.org/jira/browse/HUDI-6459 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > > The Jira aims to add validation for rollback with record level index. The > validation is added in TestRecordLevelIndex test. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6459) Add Rollback validation for Record Level Index
Lokesh Jain created HUDI-6459: - Summary: Add Rollback validation for Record Level Index Key: HUDI-6459 URL: https://issues.apache.org/jira/browse/HUDI-6459 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain Assignee: Lokesh Jain The Jira aims to add validation for rollback with record level index. The validation is added in TestRecordLevelIndex test. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9064: [HUDI-6450] Fix null strings handling in convertRowToJsonString
hudi-bot commented on PR #9064: URL: https://github.com/apache/hudi/pull/9064#issuecomment-1613172562 ## CI report: * 2b572a55998c0e1c4eca7970e8f63ed79254161c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18127) * b8418b74febf4551c0f79c7ebe71cf24916124e6 UNKNOWN * 9c6d2bf222b7247bc926302045123bad69157d39 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18198) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9063: [HUDI-6448] Improve upgrade/downgrade for table ver. 6
hudi-bot commented on PR #9063: URL: https://github.com/apache/hudi/pull/9063#issuecomment-1613172489 ## CI report: * 4775dce07f2f3237b32f22b360f3423b1eafce85 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9041: [HUDI-6431] Support update partition path in record-level index
hudi-bot commented on PR #9041: URL: https://github.com/apache/hudi/pull/9041#issuecomment-1613172359 ## CI report: * b681df04a7ad0febbcd9235622c2ee7f98759cf9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18082) * 2f139383c54f93669342539af77dca9b3a352be3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9092: [MINOR] Enable log compaction by default for MDT
hudi-bot commented on PR #9092: URL: https://github.com/apache/hudi/pull/9092#issuecomment-1613159780 ## CI report: * 408e9f946e0a0647b0fc9f8e220d55ad2fbde62d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18199) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9089: [MINOR] Increase timeout for Azure CI: UT spark-datasource to 240 minutes
hudi-bot commented on PR #9089: URL: https://github.com/apache/hudi/pull/9089#issuecomment-1613159726 ## CI report: * 4d2e8926188ce5aa2342054aeb99bf1d31eaf0e3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18190) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9064: [HUDI-6450] Fix null strings handling in convertRowToJsonString
hudi-bot commented on PR #9064: URL: https://github.com/apache/hudi/pull/9064#issuecomment-1613159516 ## CI report: * 2b572a55998c0e1c4eca7970e8f63ed79254161c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18127) * b8418b74febf4551c0f79c7ebe71cf24916124e6 UNKNOWN * 9c6d2bf222b7247bc926302045123bad69157d39 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9063: [HUDI-6448] Improve upgrade/downgrade for table ver. 6
hudi-bot commented on PR #9063: URL: https://github.com/apache/hudi/pull/9063#issuecomment-1613159448 ## CI report: * 4775dce07f2f3237b32f22b360f3423b1eafce85 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18191) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
codope commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1246592430 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -209,9 +211,10 @@ public class HoodieMetadataPayload implements HoodieRecordPayload orderingVal) { -this(Option.of(record)); + public HoodieMetadataPayload(@Nullable GenericRecord record, Comparable orderingVal) { +this(Option.ofNullable(record)); Review Comment: Where is this constructor used? ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java: ## @@ -283,6 +285,8 @@ public HoodieMetadataPayload(Option recordOpt) { Integer.parseInt(recordIndexRecord.get(RECORD_INDEX_FIELD_FILE_INDEX).toString()), Long.parseLong(recordIndexRecord.get(RECORD_INDEX_FIELD_INSTANT_TIME).toString())); } +} else { + this.isDeletedRecord = true; Review Comment: I would favor `isDeleted` field in `HoodieRecordIndexInfo` in the schema. 1. It keeps the schema consistent wrt deletes for different MDT index types. Let's say some index types have `isDeleted` and some don't, then it's an added mental burden for developers and also not easy to maintain as we add more indexes. 2. It gives enough flexibility to have separate delete handling logic for different index types. 3. Let's consider the semantics of the if-else in the `HoodieMetadataPayload` constructor. It is based on different index types. By setting `this.isDeletedRecord = true` in the last else-block we're saying that for all index types other than the ones above, consider the record to be deleted. It does not make much sense from the pov of adding more index types in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #9041: [HUDI-6431] Support update partition path in record-level index
xushiyan commented on code in PR #9041: URL: https://github.com/apache/hudi/pull/9041#discussion_r1246579974 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -310,6 +312,56 @@ public static HoodieData> mergeForPartitionUpdates( return Arrays.asList(deleteRecord, getTaggedRecord(merged, Option.empty())).iterator(); } }); -return taggedUpdatingRecords.union(newRecords); +return taggedUpdatingRecords.union(taggedNewRecords); + } + + public static HoodieData> tagGlobalLocationBackToRecords( + HoodieData> incomingRecords, + HoodiePairData keyAndExistingLocations, + boolean mayContainDuplicateLookup, + boolean shouldUpdatePartitionPath, + HoodieWriteConfig config, + HoodieTable table) { +final HoodieRecordMerger merger = config.getRecordMerger(); + +HoodiePairData> keyAndIncomingRecords = +incomingRecords.mapToPair(record -> Pair.of(record.getRecordKey(), record)); + +// Pair of incoming record and the global location if meant for merged lookup in later stage +HoodieData, Option>> incomingRecordsAndLocations += keyAndIncomingRecords.leftOuterJoin(keyAndExistingLocations).values() +.map(v -> { + final HoodieRecord incomingRecord = v.getLeft(); + Option currentLocOpt = Option.ofNullable(v.getRight().orElse(null)); + if (currentLocOpt.isPresent()) { +HoodieRecordGlobalLocation currentLoc = currentLocOpt.get(); +boolean shouldPerformMergedLookUp = mayContainDuplicateLookup +|| !Objects.equals(incomingRecord.getPartitionPath(), currentLoc.getPartitionPath()); +if (shouldUpdatePartitionPath && shouldPerformMergedLookUp) { + return Pair.of(incomingRecord, currentLocOpt); +} else { + // - When update partition path is set to false, + // the incoming record will be tagged to the existing record's partition regardless of being equal or not. + // - When update partition path is set to true, + // the incoming record will be tagged to the existing record's partition + // when partition is not updated and the look-up won't have duplicates (e.g. COW, or using RLI). + return Pair.of((HoodieRecord) getTaggedRecord( + createNewHoodieRecord(incomingRecord, currentLoc, merger), Option.of(currentLoc)), + Option.empty()); +} + } else { +return Pair.of(getTaggedRecord(incomingRecord, Option.empty()), Option.empty()); Review Comment: refactored relevant helper methods -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #9041: [HUDI-6431] Support update partition path in record-level index
xushiyan commented on code in PR #9041: URL: https://github.com/apache/hudi/pull/9041#discussion_r1246579590 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java: ## @@ -310,6 +312,56 @@ public static HoodieData> mergeForPartitionUpdates( return Arrays.asList(deleteRecord, getTaggedRecord(merged, Option.empty())).iterator(); } }); -return taggedUpdatingRecords.union(newRecords); +return taggedUpdatingRecords.union(taggedNewRecords); + } + + public static HoodieData> tagGlobalLocationBackToRecords( + HoodieData> incomingRecords, + HoodiePairData keyAndExistingLocations, + boolean mayContainDuplicateLookup, + boolean shouldUpdatePartitionPath, + HoodieWriteConfig config, + HoodieTable table) { +final HoodieRecordMerger merger = config.getRecordMerger(); + +HoodiePairData> keyAndIncomingRecords = +incomingRecords.mapToPair(record -> Pair.of(record.getRecordKey(), record)); + +// Pair of incoming record and the global location if meant for merged lookup in later stage +HoodieData, Option>> incomingRecordsAndLocations += keyAndIncomingRecords.leftOuterJoin(keyAndExistingLocations).values() +.map(v -> { + final HoodieRecord incomingRecord = v.getLeft(); + Option currentLocOpt = Option.ofNullable(v.getRight().orElse(null)); + if (currentLocOpt.isPresent()) { +HoodieRecordGlobalLocation currentLoc = currentLocOpt.get(); +boolean shouldPerformMergedLookUp = mayContainDuplicateLookup +|| !Objects.equals(incomingRecord.getPartitionPath(), currentLoc.getPartitionPath()); +if (shouldUpdatePartitionPath && shouldPerformMergedLookUp) { + return Pair.of(incomingRecord, currentLocOpt); +} else { + // - When update partition path is set to false, + // the incoming record will be tagged to the existing record's partition regardless of being equal or not. + // - When update partition path is set to true, + // the incoming record will be tagged to the existing record's partition + // when partition is not updated and the look-up won't have duplicates (e.g. COW, or using RLI). + return Pair.of((HoodieRecord) getTaggedRecord( + createNewHoodieRecord(incomingRecord, currentLoc, merger), Option.of(currentLoc)), Review Comment: fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] Improve CollectionUtils helper methods (#9088)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 8def3e68ae5 [MINOR] Improve CollectionUtils helper methods (#9088) 8def3e68ae5 is described below commit 8def3e68ae5a0b72eefe26db49b6d33226f7b4c0 Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com> AuthorDate: Thu Jun 29 05:35:19 2023 -0700 [MINOR] Improve CollectionUtils helper methods (#9088) --- .../action/clean/CleanPlanActionExecutor.java | 4 +-- .../action/commit/TestSchemaEvolutionClient.java | 3 +- .../table/action/rollback/TestRollbackUtils.java | 3 +- .../table/functional/TestCleanPlanExecutor.java| 2 +- .../apache/hudi/common/util/CollectionUtils.java | 35 +++--- .../hudi/common/table/TestTimelineUtils.java | 2 +- .../table/view/TestIncrementalFSViewSync.java | 3 +- .../hudi/common/testutils/HoodieTestTable.java | 8 ++--- 8 files changed, 23 insertions(+), 37 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java index 043db1acbf9..ba7c71b1356 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java @@ -29,7 +29,6 @@ import org.apache.hudi.common.table.timeline.HoodieInstant; import org.apache.hudi.common.table.timeline.HoodieTimeline; import org.apache.hudi.common.table.timeline.TimelineMetadataUtils; import org.apache.hudi.common.util.CleanerUtils; -import org.apache.hudi.common.util.CollectionUtils; import org.apache.hudi.common.util.Option; import org.apache.hudi.common.util.collection.Pair; import org.apache.hudi.config.HoodieWriteConfig; @@ -42,6 +41,7 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.IOException; +import java.util.Collections; import java.util.List; import java.util.Map; import java.util.stream.Collectors; @@ -132,7 +132,7 @@ public class CleanPlanActionExecutor extends BaseActionExecutor new HoodieActionInstant(x.getTimestamp(), x.getAction(), x.getState().name())).orElse(null), planner.getLastCompletedCommitTimestamp(), - config.getCleanerPolicy().name(), CollectionUtils.createImmutableMap(), + config.getCleanerPolicy().name(), Collections.emptyMap(), CleanPlanner.LATEST_CLEAN_PLAN_VERSION, cleanOps, partitionsToDelete); } catch (IOException e) { throw new HoodieIOException("Failed to schedule clean operation", e); diff --git a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestSchemaEvolutionClient.java b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestSchemaEvolutionClient.java index bf825df570f..dc45a80754b 100644 --- a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestSchemaEvolutionClient.java +++ b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestSchemaEvolutionClient.java @@ -24,7 +24,6 @@ import org.apache.hudi.common.model.HoodieAvroRecord; import org.apache.hudi.common.model.HoodieKey; import org.apache.hudi.common.table.TableSchemaResolver; import org.apache.hudi.common.testutils.RawTripTestPayload; -import org.apache.hudi.common.util.CollectionUtils; import org.apache.hudi.config.HoodieWriteConfig; import org.apache.hudi.internal.schema.Types; import org.apache.hudi.testutils.HoodieJavaClientTestHarness; @@ -72,7 +71,7 @@ public class TestSchemaEvolutionClient extends HoodieJavaClientTestHarness { .withEngineType(EngineType.JAVA) .withPath(basePath) .withSchema(SCHEMA.toString()) - .withProps(CollectionUtils.createImmutableMap(HoodieWriteConfig.TBL_NAME.key(), "hoodie_test_table")) +.withProps(Collections.singletonMap(HoodieWriteConfig.TBL_NAME.key(), "hoodie_test_table")) .build(); return new HoodieJavaWriteClient<>(context, config); } diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/rollback/TestRollbackUtils.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/rollback/TestRollbackUtils.java index f03d9f3967d..c22a2aef424 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/rollback/TestRollbackUtils.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/rollback/TestRollbackUtils.java @@ -30,6 +30,7 @@ import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.permission.FsPermission; import org.junit.jupiter.api.Test; +import jav
[GitHub] [hudi] xushiyan merged pull request #9088: [MINOR] Improve CollectionUtils helper methods
xushiyan merged PR #9088: URL: https://github.com/apache/hudi/pull/9088 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9092: [MINOR] Enable log compaction by default for MDT
hudi-bot commented on PR #9092: URL: https://github.com/apache/hudi/pull/9092#issuecomment-1613076306 ## CI report: * 408e9f946e0a0647b0fc9f8e220d55ad2fbde62d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9064: [HUDI-6450] Fix null strings handling in convertRowToJsonString
hudi-bot commented on PR #9064: URL: https://github.com/apache/hudi/pull/9064#issuecomment-1613075951 ## CI report: * 2b572a55998c0e1c4eca7970e8f63ed79254161c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18127) * b8418b74febf4551c0f79c7ebe71cf24916124e6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8609: [HUDI-6154] Introduced retry while reading hoodie.properties to deal with parallel updates.
hudi-bot commented on PR #8609: URL: https://github.com/apache/hudi/pull/8609#issuecomment-1613056925 ## CI report: * e14bd41edf6cc961d77087eea67f755f23590834 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17992) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18115) * a64034d612fa64c99dd8d319ac00680924773f53 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18197) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6458) Scheduling jobs should not fail when there is no completed commits
kwang created HUDI-6458: --- Summary: Scheduling jobs should not fail when there is no completed commits Key: HUDI-6458 URL: https://issues.apache.org/jira/browse/HUDI-6458 Project: Apache Hudi Issue Type: Improvement Reporter: kwang Fix For: 0.14.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] zaza commented on a diff in pull request #9064: [HUDI-6450] Fix null strings handling in convertRowToJsonString
zaza commented on code in PR #9064: URL: https://github.com/apache/hudi/pull/9064#discussion_r1246538265 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala: ## @@ -561,7 +561,7 @@ class HoodieCDCRDD( originTableSchema.structTypeSchema.zipWithIndex.foreach { case (field, idx) => if (field.dataType.isInstanceOf[StringType]) { -map(field.name) = record.getString(idx) +map(field.name) = Option(record.getUTF8String(idx)).map(_.toString).orNull } else { Review Comment: This is what I have based on my limited knowledge of Hudi: https://github.com/apache/hudi/pull/9064/commits/c88aee0f26afa779594a9981d86aeb3d06727d4b I'm more than happy to make further adjustments when needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope opened a new pull request, #9092: [MINOR] Enable log compaction by default for MDT
codope opened a new pull request, #9092: URL: https://github.com/apache/hudi/pull/9092 ### Change Logs Enable log compaction on metadata table by default. ### Impact Will compact log blocks to produce another log file every 5 log blocks. ### Risk level (write none, low medium or high below) medium ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9088: [MINOR] Improve CollectionUtils helper methods
hudi-bot commented on PR #9088: URL: https://github.com/apache/hudi/pull/9088#issuecomment-1613041272 ## CI report: * fb282b7602962846c4f561cd101033fca41e43d6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18182) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8609: [HUDI-6154] Introduced retry while reading hoodie.properties to deal with parallel updates.
hudi-bot commented on PR #8609: URL: https://github.com/apache/hudi/pull/8609#issuecomment-1613038827 ## CI report: * e14bd41edf6cc961d77087eea67f755f23590834 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=17992) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18115) * a64034d612fa64c99dd8d319ac00680924773f53 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6457) Keep JavaSizeBasedClusteringPlanStrategy and SparkSizeBasedClusteringPlanStrategy aligned
kwang created HUDI-6457: --- Summary: Keep JavaSizeBasedClusteringPlanStrategy and SparkSizeBasedClusteringPlanStrategy aligned Key: HUDI-6457 URL: https://issues.apache.org/jira/browse/HUDI-6457 Project: Apache Hudi Issue Type: Improvement Reporter: kwang Fix For: 0.14.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] zaza commented on a diff in pull request #9064: [HUDI-6450] Fix null strings handling in convertRowToJsonString
zaza commented on code in PR #9064: URL: https://github.com/apache/hudi/pull/9064#discussion_r1246504222 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala: ## @@ -561,7 +561,7 @@ class HoodieCDCRDD( originTableSchema.structTypeSchema.zipWithIndex.foreach { case (field, idx) => if (field.dataType.isInstanceOf[StringType]) { -map(field.name) = record.getString(idx) +map(field.name) = Option(record.getUTF8String(idx)).map(_.toString).orNull } else { Review Comment: Absolutely, the only problem is that I don't see any unit tests for the cdc package so it's hard to follow existing examples. I tried implementing a test that extends `HoodieClientTestBase` but that was getting me far from the requested "unit test". What would be the best way to start with tests for this particular issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lipusheng opened a new issue, #9091: [SUPPORT]
lipusheng opened a new issue, #9091: URL: https://github.com/apache/hudi/issues/9091 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** When I use the Spark synchronous Hive table data to Hudi table I specify "KeyGeneratorOptions. RECORDKEY_FIELD_NAME" for "id, user_id," Use the "KEYGENERATOR_CLAS" to "NonpartitionedKeyGenerator" and specify the "hoodie. The datasource. Write. Operatio" for "WriteOperationType.BULK_INSERT" In this case, the _hoodie_record_key for writing data is "125230088,6941". When I access Kafka data, I just changed the "hoodie. The datasource. Write. Operatio" for "WriteOperationType. UPSERT", but "_hoodie_record_key format has changed," The system changes to user_id:125230088,id:6941, and data duplication occurs during the query ![image](https://github.com/apache/hudi/assets/57984409/f45c37a8-b38c-4457-9677-2fcbe3bac178) **To Reproduce** Steps to reproduce the behavior: 1. 2. 3. 4. **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.12.0 * Spark version : 3.3.1 * Hive version : 3.1.3 * Hadoop version : 3.2.1 * Storage (HDFS/S3/GCS..) : OSS * Running on Docker? (yes/no) : no **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a diff in pull request #8609: [HUDI-6154] Introduced retry while reading hoodie.properties to deal with parallel updates.
codope commented on code in PR #8609: URL: https://github.com/apache/hudi/pull/8609#discussion_r1246489239 ## hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java: ## @@ -334,22 +337,43 @@ public HoodieTableConfig() { super(); } - private void fetchConfigs(FileSystem fs, String metaPath) throws IOException { + private static TypedProperties fetchConfigs(FileSystem fs, String metaPath) throws IOException { Path cfgPath = new Path(metaPath, HOODIE_PROPERTIES_FILE); -try (FSDataInputStream is = fs.open(cfgPath)) { - props.load(is); -} catch (IOException ioe) { - if (!fs.exists(cfgPath)) { -LOG.warn("Run `table recover-configs` if config update/delete failed midway. Falling back to backed up configs."); -// try the backup. this way no query ever fails if update fails midway. -Path backupCfgPath = new Path(metaPath, HOODIE_PROPERTIES_FILE_BACKUP); -try (FSDataInputStream is = fs.open(backupCfgPath)) { +Path backupCfgPath = new Path(metaPath, HOODIE_PROPERTIES_FILE_BACKUP); +int readRetryCount = 0; +boolean found = false; + +TypedProperties props = new TypedProperties(); +while (readRetryCount++ < MAX_READ_RETRIES) { + for (Path path : Arrays.asList(cfgPath, backupCfgPath)) { +// Read the properties and validate that it is a valid file +try (FSDataInputStream is = fs.open(path)) { + props.clear(); props.load(is); + found = true; + ValidationUtils.checkArgument(validateChecksum(props)); + return props; +} catch (IOException e) { + LOG.warn(String.format("Could not read properties from %s: %s", path, e)); +} catch (IllegalArgumentException e) { + LOG.warn(String.format("Invalid properties file %s: %s", path, props)); } - } else { -throw ioe; + } + + // Failed to read all files so wait before retrying. This can happen in cases of parallel updates to the properties. + try { +Thread.sleep(READ_RETRY_DELAY_MSEC); + } catch (InterruptedException e) { +LOG.warn("Interrupted while waiting"); } } + +// If we are here then after all retries either no hoodie.properties was found or only an invalid file was found. +if (found) { + throw new IllegalArgumentException("hoodie.properties file seems invalid. Please check for left over `.updated` files if any, manually copy it to hoodie.properties and retry"); +} else { + throw new HoodieIOException("Could not load Hoodie properties from " + cfgPath); Review Comment: Fixed the deltastreamer tests by modifying the exception message here as deltastreamer depends on specific messae. Pitfalls of depending on exception message as business logic! We should try to avoid that as much as apossible. https://github.com/apache/hudi/blob/b95248e011931f4748a7a9fbb8298cbbb71bda88/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L695-L697 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9082: [HUDI-6445] Distribute spark ds func tests
hudi-bot commented on PR #9082: URL: https://github.com/apache/hudi/pull/9082#issuecomment-1612915683 ## CI report: * c529c624afdca331514a2bdfb78cc6e18ab9f57a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18185) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9007: [HUDI-6405] Fix incremental file sync for clustering and logcompaction
hudi-bot commented on PR #9007: URL: https://github.com/apache/hudi/pull/9007#issuecomment-1612915077 ## CI report: * 3b6d13a83efdae5e46eebe9ae168ba7e0d8e9f34 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18189) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] LINGQ1991 commented on issue #8903: [SUPPORT] aws spark3.2.1 & hudi 0.13.1 with java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile
LINGQ1991 commented on issue #8903: URL: https://github.com/apache/hudi/issues/8903#issuecomment-1612912367 > @ad1happy2go I use emr-6.5.0. It's error with " java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile". > > But i have package with oss spark and hudi bundle. Work ok now. > > ```java > > org.apache.maven.plugins > maven-shade-plugin > 3.2.1 > > hudi-${spark.version}-plugin > false > > > > package > > shade > > > > > org.apache.spark.sql.execution.datasources.PartitionedFile > org.local.spark.sql.execution.datasources.PartitionedFile > > > org.apache.curator > org.local.curator > > > > implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"/> > > > > *:* > > module-info.class > org/apache/spark/unused/** > > > > *:* > > META-INF/*.SF > META-INF/*.DSA > META-INF/*.RSA > > > > > > > > ``` I have package with hudi bundle. But the following error occurred `Caused by: java.lang.ClassCastException: org.apache.hudi.spark.org.apache.spark.sql.execution.datasources.PartitionedFile cannot be cast to org.apache.spark.sql.execution.datasources.PartitionedFile at org.apache.hudi.HoodieMergeOnReadRDD.read(HoodieMergeOnReadRDD.scala:113) at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] flashJd commented on pull request #9048: [HUDI-6434] Fix illegalArgumentException when do read_optimized read in Flink
flashJd commented on PR #9048: URL: https://github.com/apache/hudi/pull/9048#issuecomment-1612907539 > The `DeltaCommitWriteHandleFactory` can be tweaked for the purpose, I'm wondering what's the engine conflicts you are talking about? sry to reply late ## engine conflicts: v0.12.2 when spark insert overwrite a partition after flink write the log files only bucket in this partition, https://github.com/apache/hudi/blob/b95248e011931f4748a7a9fbb8298cbbb71bda88/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java#L160 throws, but I found it was fixed in the master ## other consideration: If align the first create base file logic, many codes can be simplified, like: https://github.com/apache/hudi/blob/b95248e011931f4748a7a9fbb8298cbbb71bda88/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L362 https://github.com/apache/hudi/blob/b95248e011931f4748a7a9fbb8298cbbb71bda88/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/CompactionExecutionHelper.java#L63 https://github.com/apache/hudi/blob/b95248e011931f4748a7a9fbb8298cbbb71bda88/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java#L200 etc. what's your opinion, looking forward to your reply -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] flashJd commented on pull request #9048: [HUDI-6434] Fix illegalArgumentException when do read_optimized read in Flink
flashJd commented on PR #9048: URL: https://github.com/apache/hudi/pull/9048#issuecomment-1612904526 > sry to reply late ## engine conflicts: v0.12.2 when spark insert overwrite a partition after flink write the log files only bucket in this partition, https://github.com/apache/hudi/blob/b95248e011931f4748a7a9fbb8298cbbb71bda88/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java#L160 throws, but I found it was fixed in the master ## other consideration: If align the first create base file logic, many codes can be simplified, like: https://github.com/apache/hudi/blob/b95248e011931f4748a7a9fbb8298cbbb71bda88/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L362 https://github.com/apache/hudi/blob/b95248e011931f4748a7a9fbb8298cbbb71bda88/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/CompactionExecutionHelper.java#L63 https://github.com/apache/hudi/blob/b95248e011931f4748a7a9fbb8298cbbb71bda88/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java#L200 etc. what's your opinion, looking forward to your reply -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] beyond1920 opened a new issue, #9090: [SUPPORT]
beyond1920 opened a new issue, #9090: URL: https://github.com/apache/hudi/issues/9090 I cherry pick [HUDI-1517](https://issues.apache.org/jira/browse/HUDI-1517) into internal HUDI version. And find a FileNotFoundException during read latest snapshot of a MOR table. ![1688033363329](https://github.com/apache/hudi/assets/1525333/9330203d-866e-4c3d-96a8-922960afc152) The exception would happen if enable spark speculative feature, there exists concurrent writer and reader. For example: 1. Job1 is writing to a MOR table and not finished yet. It enables spark speculative feature. 2. Job2 is reading the latest snapshot from the MOR table, when it call getLatestMergedFileSlicesBeforeOrOn, it might list the log files which are written by speculative attempt task in Job1. 3. Job1 is finished, deletes the log files which are written by slow speculative tasks. 4. Job2 throws the FileNotFoundException when it read the log file which is already deleted in step3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] beyond1920 commented on pull request #4913: [HUDI-1517] create marker file for every log file
beyond1920 commented on PR #4913: URL: https://github.com/apache/hudi/pull/4913#issuecomment-1612808066 I cherrypick this PR to the internal HUDI. And find a `FileNotFoundException` during read latest snapshot of a mor table. ![1688033363329](https://github.com/apache/hudi/assets/1525333/99459239-1dbf-4067-8020-d4e20bae0bd1) The exception would happen if enable spark speculative feature under the following case. 1. Job1 is writing to a MOR table and not finished yet. It enables spark speculative feature. 2. Job2 is reading the latest snapshot from the MOR table, when it call `getLatestMergedFileSlicesBeforeOrOn`, it might list the log files which are written by speculative attempt task in Job1. 3. Job1 is finished, deletes the log files which are written by slow speculative tasks. 4. Job2 throws the `FileNotFoundException` when it read the log file which is already deleted in step3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9066: [HUDI-6452] Add MOR snapshot reader to integrate with query engines without using Hadoop APIs
hudi-bot commented on PR #9066: URL: https://github.com/apache/hudi/pull/9066#issuecomment-1612807150 ## CI report: * 8662958e8ccb7203d320dc33445f9f2dbc28fb0c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18159) * 60c1b8c5885fdda28e07f3ba79290f01dc60a9c4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18196) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8933: [HUDI-5329] Spark reads table error when Flink creates table without record key and primary key
hudi-bot commented on PR #8933: URL: https://github.com/apache/hudi/pull/8933#issuecomment-1612806333 ## CI report: * d1564f421664fd2dee15dfdbdae4dec07baedf92 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18186) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9066: [HUDI-6452] Add MOR snapshot reader to integrate with query engines without using Hadoop APIs
hudi-bot commented on PR #9066: URL: https://github.com/apache/hudi/pull/9066#issuecomment-1612791679 ## CI report: * 8662958e8ccb7203d320dc33445f9f2dbc28fb0c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18159) * 60c1b8c5885fdda28e07f3ba79290f01dc60a9c4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
hudi-bot commented on PR #9058: URL: https://github.com/apache/hudi/pull/9058#issuecomment-1612791490 ## CI report: * 345482ba6529fc3bf0ac9f50ce0c1d79a3accd37 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18163) * 1697d1bfa095ca16a9361e3728a77331d3a28037 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18195) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
hudi-bot commented on PR #9058: URL: https://github.com/apache/hudi/pull/9058#issuecomment-1612774450 ## CI report: * 345482ba6529fc3bf0ac9f50ce0c1d79a3accd37 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18163) * 1697d1bfa095ca16a9361e3728a77331d3a28037 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9017: [HUDI-6393] Add functional tests for RecordLevelIndex
hudi-bot commented on PR #9017: URL: https://github.com/apache/hudi/pull/9017#issuecomment-1612701307 ## CI report: * a3c1d99e2266ec68d9082fe4c76c4bf62070f5a9 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18184) * ceffe7d8146f48e1c6c083613646463c1404a77f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18194) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #9058: [HUDI-6376] Support for deletes in HUDI Indexes including metadata table record index.
xushiyan commented on code in PR #9058: URL: https://github.com/apache/hudi/pull/9058#discussion_r1246371700 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieIndex.java: ## @@ -749,6 +749,67 @@ public void testRecordIndexTagLocationAndUpdate(boolean populateMetaFields) thro assertEquals(newInsertsCount, recordLocations.filter(entry -> newPartitionPath.equalsIgnoreCase(entry._1.getPartitionPath())).count()); } + @ParameterizedTest + @ValueSource(strings = "INMEMORY") Review Comment: fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9088: [MINOR] Improve CollectionUtils helper methods
hudi-bot commented on PR #9088: URL: https://github.com/apache/hudi/pull/9088#issuecomment-1612690821 ## CI report: * fb282b7602962846c4f561cd101033fca41e43d6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18182) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant
hudi-bot commented on PR #9038: URL: https://github.com/apache/hudi/pull/9038#issuecomment-1612690558 ## CI report: * a65a29c0cf1c8feb9f39e168ba80c99ebcae1c5d UNKNOWN * 34f8823f48712c57058bc37c8936a276c1457557 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18187) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18193) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18188) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9017: [HUDI-6393] Add functional tests for RecordLevelIndex
hudi-bot commented on PR #9017: URL: https://github.com/apache/hudi/pull/9017#issuecomment-1612690440 ## CI report: * d0b2f2457cf648b1b631c75bd64cc1320af69393 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18030) * a3c1d99e2266ec68d9082fe4c76c4bf62070f5a9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18184) * ceffe7d8146f48e1c6c083613646463c1404a77f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9088: [MINOR] Improve CollectionUtils helper methods
hudi-bot commented on PR #9088: URL: https://github.com/apache/hudi/pull/9088#issuecomment-1612678874 ## CI report: * fb282b7602962846c4f561cd101033fca41e43d6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9063: [HUDI-6448] Improve upgrade/downgrade for table ver. 6
hudi-bot commented on PR #9063: URL: https://github.com/apache/hudi/pull/9063#issuecomment-1612678677 ## CI report: * 69b2bb853be0f79845efd56f68b934b9f69ae22a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18160) * 4775dce07f2f3237b32f22b360f3423b1eafce85 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18191) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9038: [HUDI-6423] Incremental cleaning should consider inflight compaction instant
hudi-bot commented on PR #9038: URL: https://github.com/apache/hudi/pull/9038#issuecomment-1612678539 ## CI report: * a65a29c0cf1c8feb9f39e168ba80c99ebcae1c5d UNKNOWN * 34f8823f48712c57058bc37c8936a276c1457557 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18188) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18187) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18193) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6151) Rollback previously applied commits to MDT when operations are retried.
[ https://issues.apache.org/jira/browse/HUDI-6151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6151. Resolution: Fixed Fixed via master branch: b95248e011931f4748a7a9fbb8298cbbb71bda88 > Rollback previously applied commits to MDT when operations are retried. > --- > > Key: HUDI-6151 > URL: https://issues.apache.org/jira/browse/HUDI-6151 > Project: Apache Hudi > Issue Type: Bug >Reporter: Prashant Wason >Assignee: Prashant Wason >Priority: Blocker > Labels: pull-request-available > Fix For: 0.14.0 > > > Operations like Clean, Compaction are retried after failures with the same > instant time. If the previous run of the operation successfully committed to > the MDT but failed to commit to the dataset, then the operation will be > retried later with the same instantTime causing duplicate updates applied to > MDT. > Currently, we simply delete the completed deltacommit without rolling back > the deltacommit. > To handle this, we detect a replay of operation and rollback any changes from > that operation in MDT. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6151] Rollback previously applied commits to MDT when operations are retried (#8604)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new b95248e0119 [HUDI-6151] Rollback previously applied commits to MDT when operations are retried (#8604) b95248e0119 is described below commit b95248e011931f4748a7a9fbb8298cbbb71bda88 Author: Prashant Wason AuthorDate: Thu Jun 29 01:59:08 2023 -0700 [HUDI-6151] Rollback previously applied commits to MDT when operations are retried (#8604) Operations like Clean, Compaction are retried after failures with the same instant time. If the previous run of the operation successfully committed to the MDT but failed to commit to the dataset, then the operation will be retried later with the same instantTime causing duplicate updates applied to MDT. Currently, we simply delete the completed deltacommit without rolling back the deltacommit. To handle this, we detect a replay of operation and rollback any changes from that operation in MDT. - Co-authored-by: Sagar Sumit --- .../FlinkHoodieBackedTableMetadataWriter.java | 50 .../SparkHoodieBackedTableMetadataWriter.java | 38 ++-- .../functional/TestHoodieBackedMetadata.java | 68 +- 3 files changed, 113 insertions(+), 43 deletions(-) diff --git a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/metadata/FlinkHoodieBackedTableMetadataWriter.java b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/metadata/FlinkHoodieBackedTableMetadataWriter.java index 7dd32e2916e..6edeac05a74 100644 --- a/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/metadata/FlinkHoodieBackedTableMetadataWriter.java +++ b/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/metadata/FlinkHoodieBackedTableMetadataWriter.java @@ -32,9 +32,13 @@ import org.apache.hudi.common.table.timeline.HoodieInstant; import org.apache.hudi.common.util.Option; import org.apache.hudi.common.util.ValidationUtils; import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieMetadataException; import org.apache.hudi.exception.HoodieNotSupportedException; import org.apache.hadoop.conf.Configuration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + import java.util.Collections; import java.util.HashMap; import java.util.List; @@ -46,7 +50,7 @@ import static org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy.EAGE * Flink hoodie backed table metadata writer. */ public class FlinkHoodieBackedTableMetadataWriter extends HoodieBackedTableMetadataWriter { - + private static final Logger LOG = LoggerFactory.getLogger(FlinkHoodieBackedTableMetadataWriter.class); private transient BaseHoodieWriteClient writeClient; public static HoodieTableMetadataWriter create(Configuration conf, HoodieWriteConfig writeConfig, @@ -118,33 +122,31 @@ public class FlinkHoodieBackedTableMetadataWriter extends HoodieBackedTableMetad if (!metadataMetaClient.getActiveTimeline().containsInstant(instantTime)) { // if this is a new commit being applied to metadata for the first time -writeClient.startCommitWithTime(instantTime); - metadataMetaClient.getActiveTimeline().transitionRequestedToInflight(HoodieActiveTimeline.DELTA_COMMIT_ACTION, instantTime); +LOG.info("New commit at " + instantTime + " being applied to MDT."); } else { -Option alreadyCompletedInstant = metadataMetaClient.getActiveTimeline().filterCompletedInstants().filter(entry -> entry.getTimestamp().equals(instantTime)).lastInstant(); -if (alreadyCompletedInstant.isPresent()) { - // this code path refers to a re-attempted commit that got committed to metadata table, but failed in datatable. - // for eg, lets say compaction c1 on 1st attempt succeeded in metadata table and failed before committing to datatable. - // when retried again, data table will first rollback pending compaction. these will be applied to metadata table, but all changes - // are upserts to metadata table and so only a new delta commit will be created. - // once rollback is complete, compaction will be retried again, which will eventually hit this code block where the respective commit is - // already part of completed commit. So, we have to manually remove the completed instant and proceed. - // and it is for the same reason we enabled withAllowMultiWriteOnSameInstant for metadata table. - HoodieActiveTimeline.deleteInstantFile(metadataMetaClient.getFs(), metadataMetaClient.getMetaPath(), alreadyCompletedInstant.get()); - metadataMetaClient.reloadActiveTimeline(); +// this code path refers to a re-attempted commit that: +// 1. got committed to metadat
[GitHub] [hudi] danny0405 merged pull request #8604: [HUDI-6151] Rollback previously applied commits to MDT when operations are retried.
danny0405 merged PR #8604: URL: https://github.com/apache/hudi/pull/8604 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lokeshj1703 commented on a diff in pull request #9017: [HUDI-6393] Add functional tests for RecordLevelIndex
lokeshj1703 commented on code in PR #9017: URL: https://github.com/apache/hudi/pull/9017#discussion_r1246314270 ## pom.xml: ## @@ -175,7 +175,7 @@ 2.12.10 ${scala12.version} 2.8.1 -2.12 +2.11 Review Comment: Sorry! Forgot to remove this change. This was only for fixing the issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a diff in pull request #9017: [HUDI-6393] Add functional tests for RecordLevelIndex
xushiyan commented on code in PR #9017: URL: https://github.com/apache/hudi/pull/9017#discussion_r1246304418 ## pom.xml: ## @@ -175,7 +175,7 @@ 2.12.10 ${scala12.version} 2.8.1 -2.12 +2.11 Review Comment: this is the default value which should be 2.12 because spark 3 is default now. If this is causing a problem, it means the test setup with spark 2.4 profile has some gap, which we need to only fix for that profile/setup -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9063: [HUDI-6448] Improve upgrade/downgrade for table ver. 6
hudi-bot commented on PR #9063: URL: https://github.com/apache/hudi/pull/9063#issuecomment-1612621031 ## CI report: * 69b2bb853be0f79845efd56f68b934b9f69ae22a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18160) * 4775dce07f2f3237b32f22b360f3423b1eafce85 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-5608) Support decimals w/ precision > 30 in Column Stats
[ https://issues.apache.org/jira/browse/HUDI-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738431#comment-17738431 ] 赵富午 commented on HUDI-5608: --- Is there any new progress? > Support decimals w/ precision > 30 in Column Stats > -- > > Key: HUDI-5608 > URL: https://issues.apache.org/jira/browse/HUDI-5608 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.12.2 >Reporter: Alexey Kudinkin >Priority: Critical > Fix For: 0.14.0 > > > As reported in: [https://github.com/apache/hudi/issues/7732] > > Currently we've limited precision of the supported decimals at 30 assuming > that this number is reasonably high to cover 99% of use-cases, but it seems > like there's still a demand for even larger Decimals. > The challenge is however to balance the need to support longer Decimals vs > storage space we have to provision for each one of them. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #8604: [HUDI-6151] Rollback previously applied commits to MDT when operations are retried.
hudi-bot commented on PR #8604: URL: https://github.com/apache/hudi/pull/8604#issuecomment-1612619567 ## CI report: * eb39bc7559945e199e43a2a3d51e1ab15b4e3e2f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18183) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9087: [HUDI-6329] Write pipelines for table with consistent bucket index would detect whether clustering service occurs and automatically adjust the
hudi-bot commented on PR #9087: URL: https://github.com/apache/hudi/pull/9087#issuecomment-1612610932 ## CI report: * 1bc4ea70966fd2c2cbd7cea126f4fd6b5c875077 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18181) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9088: [MINOR] Improve CollectionUtils helper methods
hudi-bot commented on PR #9088: URL: https://github.com/apache/hudi/pull/9088#issuecomment-1612610988 ## CI report: * fb282b7602962846c4f561cd101033fca41e43d6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18182) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org