Re: [PR] [HUDI-6898] Medatawriter closing in tests, update logging [hudi]
yihua commented on code in PR #9768: URL: https://github.com/apache/hudi/pull/9768#discussion_r1369690946 ## hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java: ## @@ -3545,6 +3546,7 @@ private List getAllFiles(HoodieTableMetadata metadata) throws Exception { return allfiles; } + // TODO Review Comment: nit: should be removed? ## pom.xml: ## @@ -115,7 +115,7 @@ 2.17.2 1.7.36 2.9.9 -2.10.1 +2.10.2 Review Comment: Avoid version upgrade in this PR? ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadSnapshotReader.java: ## @@ -137,7 +137,8 @@ public HoodieMergeOnReadSnapshotReader(String tableBasePath, } } } -LOG.debug("Time taken to merge base file and log file records: {}", timer.endTimer()); +long executionTime = timer.endTimer(); +LOG.debug("Time taken to merge base file and log file records: {}", executionTime); Review Comment: nit: no need to change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6932] Updated batch size for delete partitions for Glue sync tool [hudi]
hudi-bot commented on PR #9842: URL: https://github.com/apache/hudi/pull/9842#issuecomment-1776625853 ## CI report: * 10d1cad3a2625c7276c6d8d04c4c258f732e9af8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20270) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6932] Updated batch size for delete partitions for Glue sync tool [hudi]
hudi-bot commented on PR #9842: URL: https://github.com/apache/hudi/pull/9842#issuecomment-1776617382 ## CI report: * 10d1cad3a2625c7276c6d8d04c4c258f732e9af8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6896] HoodieAvroHFileReader.RecordIterator iteration never terminates [hudi]
yihua commented on code in PR #9789: URL: https://github.com/apache/hudi/pull/9789#discussion_r1369683789 ## hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java: ## @@ -684,6 +685,10 @@ private static class RecordIterator implements ClosableIterator { public boolean hasNext() { try { // NOTE: This is required for idempotency +if (eof) { + return false; +} Review Comment: Under what condition does the infinite iteration happen? How to reproduce it in a test? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]
stream2000 commented on PR #9887: URL: https://github.com/apache/hudi/pull/9887#issuecomment-1776608491 > so in such case, files are always created already? @boneanxs We are still checking the source code of Spark to confirm the mechanism. However, in my local test, we did find out that new files were written after the rollback was scheduled. You can add a breakpoint at the `abort` method and run the test to reproduce it locally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]
stream2000 commented on PR #9887: URL: https://github.com/apache/hudi/pull/9887#issuecomment-1776604419 > in the test I don't see explicit failure injection. How is the abort called and is it deterministically triggered in the test? ```java // We can only upsert to existing consistent hashing bucket index table checkExceptionContain(insertStatement)("Consistent Hashing bulk_insert only support write to new file group") ``` @yihua We don't allow bulk insert into consistent hashing index table that already have parquet files because bulk insert v2 only support write parquet now. So bulk insert into the table will cause a exception and it is deterministically. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6959] Bulk insert V2 do not rollback failed instant on abort [hudi]
yihua commented on PR #9887: URL: https://github.com/apache/hudi/pull/9887#issuecomment-1776599636 @stream2000 in the test I don't see explicit failure injection. How is the `abort` called and is it deterministically triggered in the test? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6482] Supports new compaction strategy DayBasedAndBoundedIOCompactionStrategy [hudi]
yihua commented on PR #9126: URL: https://github.com/apache/hudi/pull/9126#issuecomment-1776582776 @ksmou could you try reopen the PR on your side? I'm not able to reopen it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6960] Support read partition values from path when schema evolution enabled [hudi]
yihua commented on code in PR #9889: URL: https://github.com/apache/hudi/pull/9889#discussion_r1369661651 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala: ## @@ -149,27 +152,10 @@ case class BaseFileOnlyRelation(override val sqlContext: SQLContext, val enableFileIndex = HoodieSparkConfUtils.getConfigValue(optParams, sparkSession.sessionState.conf, ENABLE_HOODIE_FILE_INDEX.key, ENABLE_HOODIE_FILE_INDEX.defaultValue.toString).toBoolean if (enableFileIndex && globPaths.isEmpty) { - // NOTE: There are currently 2 ways partition values could be fetched: - // - Source columns (producing the values used for physical partitioning) will be read - // from the data file - // - Values parsed from the actual partition path would be appended to the final dataset - // - //In the former case, we don't need to provide the partition-schema to the relation, - //therefore we simply stub it w/ empty schema and use full table-schema as the one being - //read from the data file. Review Comment: @wecharyu `shouldExtractPartitionValuesFromPartitionPath` can still return `false` based on super class? ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestGetPartitionValuesFromPath.scala: ## @@ -90,4 +90,37 @@ class TestGetPartitionValuesFromPath extends HoodieSparkSqlTestBase { } } } + + test("Test get partition values from path when schema evolution applied") { +withTable(generateTableName) { tableName => + spark.sql( +s""" + |create table $tableName ( + | id int, + | name string, + | ts bigint, + | region string, + | dt date + |) using hudi + |tblproperties ( + | primaryKey = 'id', + | type = 'cow', + | preCombineField = 'ts', + | hoodie.datasource.write.drop.partition.columns = 'true' + |) + |partitioned by (region, dt)""".stripMargin) + + spark.sql(s"insert into $tableName partition (region='reg1', dt='2023-10-01') select 1, 'name1', 1000") + checkAnswer(s"select id, name, ts, region, cast(dt as string) from $tableName")( +Seq(1, "name1", 1000, "reg1", "2023-10-01") + ) Review Comment: When writing the table, `hoodie.schema.on.read.enable=true` should also be set to enable schema evolution on read. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 [hudi]
yihua commented on code in PR #9895: URL: https://github.com/apache/hudi/pull/9895#discussion_r1369644417 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/command/index/TestIndexSyntax.scala: ## @@ -28,59 +29,61 @@ import org.apache.spark.sql.hudi.command.{CreateIndexCommand, DropIndexCommand, class TestIndexSyntax extends HoodieSparkSqlTestBase { test("Test Create/Drop/Show/Refresh Index") { -withTempDir { tmp => - Seq("cow", "mor").foreach { tableType => -val databaseName = "default" -val tableName = generateTableName -val basePath = s"${tmp.getCanonicalPath}/$tableName" -spark.sql( - s""" - |create table $tableName ( - | id int, - | name string, - | price double, - | ts long - |) using hudi - | options ( - | primaryKey ='id', - | type = '$tableType', - | preCombineField = 'ts' - | ) - | partitioned by(ts) - | location '$basePath' +if (HoodieSparkUtils.gteqSpark3_2) { Review Comment: Looks like `TestSecondaryIndex` should also have a precondition on the spark version. ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/IndexCommands.scala: ## @@ -32,23 +31,21 @@ import org.apache.spark.sql.hudi.HoodieSqlCommonUtils.getTableLocation import org.apache.spark.sql.{Row, SparkSession} import java.util - import scala.collection.JavaConverters.{collectionAsScalaIterableConverter, mapAsJavaMapConverter} case class CreateIndexCommand(table: CatalogTable, indexName: String, indexType: String, ignoreIfExists: Boolean, - columns: Seq[(Attribute, Map[String, String])], - options: Map[String, String], - override val output: Seq[Attribute]) extends IndexBaseCommand { + columns: Seq[(Seq[String], Map[String, String])], + options: Map[String, String]) extends IndexBaseCommand { override def run(sparkSession: SparkSession): Seq[Row] = { val tableId = table.identifier val metaClient = createHoodieTableMetaClient(tableId, sparkSession) val columnsMap: java.util.LinkedHashMap[String, java.util.Map[String, String]] = new util.LinkedHashMap[String, java.util.Map[String, String]]() -columns.map(c => columnsMap.put(c._1.name, c._2.asJava)) +columns.map(c => columnsMap.put(c._1.mkString("."), c._2.asJava)) Review Comment: Why change this? for nested fields? ## hudi-spark-datasource/hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_3ExtendedSqlAstBuilder.scala: ## @@ -3327,6 +3327,145 @@ class HoodieSpark3_3ExtendedSqlAstBuilder(conf: SQLConf, delegate: ParserInterfa position = Option(ctx.colPosition).map(pos => UnresolvedFieldPosition(typedVisit[ColumnPosition](pos } + + /** Review Comment: I assume the SQL parsing of INDEX SQL statement should not be different across Spark versions. ## hudi-spark-datasource/hudi-spark3.2.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_2ExtendedSqlAstBuilder.scala: ## @@ -3317,6 +3317,145 @@ class HoodieSpark3_2ExtendedSqlAstBuilder(conf: SQLConf, delegate: ParserInterfa position = Option(ctx.colPosition).map(pos => UnresolvedFieldPosition(typedVisit[ColumnPosition](pos } + + /** Review Comment: Got it. So at least CreateIndex is still supported in Spark 3.2. ## hudi-spark-datasource/hudi-spark/src/main/antlr4/org/apache/hudi/spark/sql/parser/HoodieSqlCommon.g4: ## @@ -135,51 +120,13 @@ nonReserved : CALL | COMPACTION - | CREATE - | DROP - | EXISTS - | FROM - | IN - | INDEX - | INDEXES - | IF Review Comment: Do we still need some of these tokens for other SQL statements? ## hudi-spark-datasource/hudi-spark3.3.x/src/main/antlr4/org/apache/hudi/spark/sql/parser/HoodieSqlBase.g4: ## @@ -29,5 +29,12 @@ statement | createTableHeader ('(' colTypeList ')')? tableProvider? createTableClauses (AS? query)? #createTable +| CREATE INDEX (IF NOT EXISTS)? identifier ON TABLE? Review Comment: Could we still maintain the grammar in a single place for all Spark versions, but fail the logical plan of INDEX SQL statement in Spark 3.1 and below, so the grammar can be easily maintained? ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieSqlCommonAstBuilder.scala: ## @@ -149,144 +149,4 @@ class HoodieSqlCommonAstBuilder(session: SparkSession, delegate: ParserI
Re: [PR] [HUDI-6963] Fix class conflict of CreateIndex from Spark3.3 [hudi]
yihua commented on PR #9895: URL: https://github.com/apache/hudi/pull/9895#issuecomment-1776548151 cc @codope -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Facing java.util.NoSuchElementException on EMR 6.12 (Hudi 0.13) with inline compaction and cleaning on MoR tables [hudi]
ad1happy2go commented on issue #9861: URL: https://github.com/apache/hudi/issues/9861#issuecomment-1776508692 @arunvasudevan Are you there on hudi slack? If yes, can you message me there to have a call to understand the issue more. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] merge into hudi table with ArrayIndexOutOfBoundsException error [hudi]
ad1happy2go commented on issue #9865: URL: https://github.com/apache/hudi/issues/9865#issuecomment-1776506908 Can you give more details @zyclove on what MERGE INTO you are trying. Also your table configuration. I can make sure if what you are facing is something known or not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] HoodieCompaction with schema parse NullPointerException [hudi]
ad1happy2go commented on issue #9902: URL: https://github.com/apache/hudi/issues/9902#issuecomment-1776503543 Yeah The version gets automatically upgrades when you write using new version. 0.14.0 uses table version 6. So the behaviour is expected. Not sure why it failed though. I will also create a table using 0.12.3 and try to upgrade and see i get any issues. Do you use slack? If yes, you can join hudi community slack and we can sync up there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] HoodieCompaction with schema parse NullPointerException [hudi]
zyclove commented on issue #9902: URL: https://github.com/apache/hudi/issues/9902#issuecomment-1776497904 Is there a WeChat group or other communication group where we can communicate with each other? The community group I joined before felt very inactive, and no one discussed the issues. @ad1happy2go -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] HoodieCompaction with schema parse NullPointerException [hudi]
zyclove commented on issue #9902: URL: https://github.com/apache/hudi/issues/9902#issuecomment-1776496036 @ad1happy2go This issue is the same as https://github.com/apache/hudi/issues/9016 . This problem is caused by the upgrade to version 0.14. After the upgrade, this problem suddenly occurred after running for a few days. After working on it all morning yesterday, there was really nothing I could do, so I cleaned up the historical data and ran it again, and it became normal afterwards. Is it still caused by version compatibility issues? It was 0.12.3 before. After directly upgrading the 0.14 bundle package, I found that the version of the hoodie.properties file in the data table changed from 5 to 6. Does this mean that the version has been upgraded normally? There is no manual upgrade table operation through commands. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [BUG]hudi cli command with Wrong FS error [hudi]
zyclove commented on issue #9903: URL: https://github.com/apache/hudi/issues/9903#issuecomment-1776479980 @ad1happy2go connect --path s:// is ok. compactions show all works well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]
beyond1920 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1369499290 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -2616,6 +2617,22 @@ public Integer getWritesFileIdEncoding() { return props.getInteger(WRITES_FILEID_ENCODING, HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID); } + public boolean needResolveWriteConflict(Option operationType) { +if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) { + // Skip to resolve conflict for non bulk_insert operation if using non-blocking concurrency control + // TODO: skip resolve conflict if the option is empty or the inner operation type is UNKNOWN ? + return !isNonBlockingConcurrencyControl() || mayBeBulkInsert(operationType); Review Comment: Just to ensure the following two cases: 1. We wanna skip resolve conflict if the optionType is `Option.empty` or the inner operation type is `UNKNOWN`? 2. If the operationType is `Option.empty`, `operationType.get()` would throw `NoSuchElementException`. That's what we what? Or using `BULK_INSERT.equals(operationType.orElse(null))`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]
beyond1920 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1369499290 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -2616,6 +2617,22 @@ public Integer getWritesFileIdEncoding() { return props.getInteger(WRITES_FILEID_ENCODING, HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID); } + public boolean needResolveWriteConflict(Option operationType) { +if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) { + // Skip to resolve conflict for non bulk_insert operation if using non-blocking concurrency control + // TODO: skip resolve conflict if the option is empty or the inner operation type is UNKNOWN ? + return !isNonBlockingConcurrencyControl() || mayBeBulkInsert(operationType); Review Comment: Just to ensure the following two cases: 1. We wanna skip resolve conflict if the optionType is `Option.empty` or the inner operation type is UNKNOWN? 2. If the operationType is `Option.empty`, `operationType.get()` would throw `NoSuchElementException`. That's what we what? Or using `BULK_INSERT.equals(operationType.orElse(null))`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]
beyond1920 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1369499290 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -2616,6 +2617,22 @@ public Integer getWritesFileIdEncoding() { return props.getInteger(WRITES_FILEID_ENCODING, HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID); } + public boolean needResolveWriteConflict(Option operationType) { +if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) { + // Skip to resolve conflict for non bulk_insert operation if using non-blocking concurrency control + // TODO: skip resolve conflict if the option is empty or the inner operation type is UNKNOWN ? + return !isNonBlockingConcurrencyControl() || mayBeBulkInsert(operationType); Review Comment: Just to ensure the following two cases: 1. We wanna skip resolve conflict if the option is empty or the inner operation type is UNKNOWN? 2. If the operationType is Option.empty, operationType.get() would throw `NoSuchElementException`. That's what we what? Or using `BULK_INSERT.equals(operationType.orElse(null))`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]
hudi-bot commented on PR #9894: URL: https://github.com/apache/hudi/pull/9894#issuecomment-1776409405 ## CI report: * 75e98fe81be61e02f30d41d798ea86b733a26e2a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20448) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6971] OOM caused by configuring read.start_commit as earliest in stream reading [hudi]
danny0405 commented on code in PR #9906: URL: https://github.com/apache/hudi/pull/9906#discussion_r1369461753 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java: ## @@ -407,21 +408,23 @@ private Result getHollowInputSplits( } @Nullable - private InstantRange getInstantRange(String issuedInstant, String instantToIssue, boolean nullableBoundary) { + private InstantRange getInstantRange(String issuedInstant, String startInstant, String instantToIssue, boolean nullableBoundary) { if (issuedInstant != null) { // the streaming reader may record the last issued instant, if the issued instant is present, // the instant range should be: (issued instant, the latest instant]. return InstantRange.builder().startInstant(issuedInstant).endInstant(instantToIssue) .nullableBoundary(nullableBoundary).rangeType(InstantRange.RangeType.OPEN_CLOSE).build(); -} else if (this.conf.getOptional(FlinkOptions.READ_START_COMMIT).isPresent()) { - // first time consume and has a start commit +} else if (this.conf.getOptional(FlinkOptions.READ_START_COMMIT).isPresent() +&& !this.conf.getString(FlinkOptions.READ_START_COMMIT).equalsIgnoreCase(FlinkOptions.START_COMMIT_LATEST)) { + // first time consume , consumes form earliest commit or consumes from a start commit. final String startCommit = this.conf.getString(FlinkOptions.READ_START_COMMIT); return startCommit.equalsIgnoreCase(FlinkOptions.START_COMMIT_EARLIEST) - ? null + ? InstantRange.builder().startInstant(startInstant).endInstant(instantToIssue) Review Comment: Reading from the latest commit is the default behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6798] Add record merging mode and implement event-time ordering in the new file group reader [hudi]
hudi-bot commented on PR #9894: URL: https://github.com/apache/hudi/pull/9894#issuecomment-1776362033 ## CI report: * 74dab9f4a045822aef5565ff24cb8bbf15ef0f65 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20410) * 75e98fe81be61e02f30d41d798ea86b733a26e2a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]
danny0405 commented on code in PR #9896: URL: https://github.com/apache/hudi/pull/9896#discussion_r1369460543 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -2616,6 +2617,22 @@ public Integer getWritesFileIdEncoding() { return props.getInteger(WRITES_FILEID_ENCODING, HoodieMetadataPayload.RECORD_INDEX_FIELD_FILEID_ENCODING_UUID); } + public boolean needResolveWriteConflict(Option operationType) { +if (getWriteConcurrencyMode().supportsOptimisticConcurrencyControl()) { + // Skip to resolve conflict for non bulk_insert operation if using non-blocking concurrency control + // TODO: skip resolve conflict if the option is empty or the inner operation type is UNKNOWN ? + return !isNonBlockingConcurrencyControl() || mayBeBulkInsert(operationType); Review Comment: Remove the Option from the param and change the logic to: ```java return BULK_INSERT.equals(operationType.get()) || !isNonBlockingConcurrencyControl() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6482] Supports new compaction strategy DayBasedAndBoundedIOCompactionStrategy [hudi]
ksmou commented on PR #9126: URL: https://github.com/apache/hudi/pull/9126#issuecomment-1776358774 @yihua plz reopen this, I delete it by mistake. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6798) Implement event-time-based merging mode in FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6798: Status: Patch Available (was: In Progress) > Implement event-time-based merging mode in FileGroupReader > -- > > Key: HUDI-6798 > URL: https://issues.apache.org/jira/browse/HUDI-6798 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
danny0405 commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1369448954 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecord.java: ## @@ -195,6 +206,10 @@ public HoodieKey getKey() { return key; } + public boolean isPartial() { +return isPartial; Review Comment: -1, does not make sense -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
danny0405 commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1369448429 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java: ## @@ -126,12 +128,13 @@ protected Option doProcessNextDataRecord(T record, // Merge and store the combined record // Note that the incoming `record` is from an older commit, so it should be put as // the `older` in the merge API + HoodieRecord combinedRecord = (HoodieRecord) recordMerger.merge( - readerContext.constructHoodieRecord(Option.of(record), metadata, readerSchema), - readerSchema, + readerContext.constructHoodieRecord(Option.of(record), metadata), + (Schema) metadata.get(INTERNAL_META_SCHEMA), readerContext.constructHoodieRecord( - existingRecordMetadataPair.getLeft(), existingRecordMetadataPair.getRight(), readerSchema), - readerSchema, + existingRecordMetadataPair.getLeft(), existingRecordMetadataPair.getRight()), + (Schema) existingRecordMetadataPair.getRight().get(INTERNAL_META_SCHEMA), payloadProps).get().getLeft(); Review Comment: But it is specific per-file at lest right? Then we can initialize it each time ther reader prepares to read a new file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
danny0405 commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1369448429 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java: ## @@ -126,12 +128,13 @@ protected Option doProcessNextDataRecord(T record, // Merge and store the combined record // Note that the incoming `record` is from an older commit, so it should be put as // the `older` in the merge API + HoodieRecord combinedRecord = (HoodieRecord) recordMerger.merge( - readerContext.constructHoodieRecord(Option.of(record), metadata, readerSchema), - readerSchema, + readerContext.constructHoodieRecord(Option.of(record), metadata), + (Schema) metadata.get(INTERNAL_META_SCHEMA), readerContext.constructHoodieRecord( - existingRecordMetadataPair.getLeft(), existingRecordMetadataPair.getRight(), readerSchema), - readerSchema, + existingRecordMetadataPair.getLeft(), existingRecordMetadataPair.getRight()), + (Schema) existingRecordMetadataPair.getRight().get(INTERNAL_META_SCHEMA), payloadProps).get().getLeft(); Review Comment: But it is specific per-file at lest right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
danny0405 commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1369447936 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala: ## @@ -411,10 +414,14 @@ object ExpressionPayload { parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA)) } - private def getWriterSchema(props: Properties): Schema = { - ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key), - s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property") -parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key)) + private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): Schema = { +if (isPartialUpdate) { + parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key)) Review Comment: Generally we may have 3 modes for fields that not updated in partial update: 1. keep it as it is; 2. force update it as null; (which I think should never happen in real case); 3. overwrite with default (if the detault is defined in the schema) I think 1 is the most natural handling, but in any case, there reader should always use it's own reader schema for merging, not the writer schema. Another question is when to evolve the table schema, does it happends before or after the commit succeed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6539] New LSM tree style archived timeline [hudi]
danny0405 commented on PR #9209: URL: https://github.com/apache/hudi/pull/9209#issuecomment-1776308765 > Hello, does the master branch now support lsm format merge? @danny0405 No, only the archived timeline uses LSM layout for instants access. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6973] Instantiate HoodieFileGroupRecordBuffer inside new file group reader [hudi]
hudi-bot commented on PR #9910: URL: https://github.com/apache/hudi/pull/9910#issuecomment-1776284518 ## CI report: * f158692bc1611582566b3bbd76e49d07a290e802 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20447) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6973] Instantiate HoodieFileGroupRecordBuffer inside new file group reader [hudi]
hudi-bot commented on PR #9910: URL: https://github.com/apache/hudi/pull/9910#issuecomment-1776271532 ## CI report: * f158692bc1611582566b3bbd76e49d07a290e802 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20447) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6974) Cleanup config descriptions for consistent language and clarity
Bhavani Sudha created HUDI-6974: --- Summary: Cleanup config descriptions for consistent language and clarity Key: HUDI-6974 URL: https://issues.apache.org/jira/browse/HUDI-6974 Project: Apache Hudi Issue Type: Task Reporter: Bhavani Sudha Assignee: Bhavani Sudha -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6972) Fix redirection to individual config links
[ https://issues.apache.org/jira/browse/HUDI-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha closed HUDI-6972. --- Resolution: Fixed > Fix redirection to individual config links > -- > > Key: HUDI-6972 > URL: https://issues.apache.org/jira/browse/HUDI-6972 > Project: Apache Hudi > Issue Type: Task >Reporter: Bhavani Sudha >Assignee: Bhavani Sudha >Priority: Minor > Labels: docs, pull-request-available > > Currently, the links for configs are not working as expected. The top of the > page is rendered instead of the actual config section. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6973] Instantiate HoodieFileGroupRecordBuffer inside new file group reader [hudi]
hudi-bot commented on PR #9910: URL: https://github.com/apache/hudi/pull/9910#issuecomment-1776231883 ## CI report: * f158692bc1611582566b3bbd76e49d07a290e802 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6973) Instantiate HoodieFileGroupRecordBuffer inside new file group reader
[ https://issues.apache.org/jira/browse/HUDI-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6973: Reviewers: Danny Chen, Lin Liu > Instantiate HoodieFileGroupRecordBuffer inside new file group reader > > > Key: HUDI-6973 > URL: https://issues.apache.org/jira/browse/HUDI-6973 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6928) Support position based merging in HoodieFileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-6928. --- Resolution: Fixed > Support position based merging in HoodieFileGroupReader > --- > > Key: HUDI-6928 > URL: https://issues.apache.org/jira/browse/HUDI-6928 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Lin Liu >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6973) Instantiate HoodieFileGroupRecordBuffer inside new file group reader
[ https://issues.apache.org/jira/browse/HUDI-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6973: Status: Patch Available (was: In Progress) > Instantiate HoodieFileGroupRecordBuffer inside new file group reader > > > Key: HUDI-6973 > URL: https://issues.apache.org/jira/browse/HUDI-6973 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6973) Instantiate HoodieFileGroupRecordBuffer inside new file group reader
[ https://issues.apache.org/jira/browse/HUDI-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6973: Status: In Progress (was: Open) > Instantiate HoodieFileGroupRecordBuffer inside new file group reader > > > Key: HUDI-6973 > URL: https://issues.apache.org/jira/browse/HUDI-6973 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6973) Instantiate HoodieFileGroupRecordBuffer inside new file group reader
[ https://issues.apache.org/jira/browse/HUDI-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6973: Fix Version/s: 1.0.0 > Instantiate HoodieFileGroupRecordBuffer inside new file group reader > > > Key: HUDI-6973 > URL: https://issues.apache.org/jira/browse/HUDI-6973 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6973) Instantiate HoodieFileGroupRecordBuffer inside new file group reader
[ https://issues.apache.org/jira/browse/HUDI-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6973: - Labels: pull-request-available (was: ) > Instantiate HoodieFileGroupRecordBuffer inside new file group reader > > > Key: HUDI-6973 > URL: https://issues.apache.org/jira/browse/HUDI-6973 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-6973] Instantiate HoodieFileGroupRecordBuffer inside new file group reader [hudi]
yihua opened a new pull request, #9910: URL: https://github.com/apache/hudi/pull/9910 ### Change Logs This PR refactors the new file group reader (`HoodieFileGroupReader`) to instantiate `HoodieFileGroupRecordBuffer` inside the file group reader's constrcutors, instead of being passed in from outside. ### Impact Simplifies the instantiation of the new file group reader. ### Risk level none ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6973) Instantiate HoodieFileGroupRecordBuffer inside new file group reader
[ https://issues.apache.org/jira/browse/HUDI-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6973: Epic Link: HUDI-6243 Story Points: 2 > Instantiate HoodieFileGroupRecordBuffer inside new file group reader > > > Key: HUDI-6973 > URL: https://issues.apache.org/jira/browse/HUDI-6973 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6973) Instantiate HoodieFileGroupRecordBuffer inside new file group reader
[ https://issues.apache.org/jira/browse/HUDI-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6973: Priority: Blocker (was: Major) > Instantiate HoodieFileGroupRecordBuffer inside new file group reader > > > Key: HUDI-6973 > URL: https://issues.apache.org/jira/browse/HUDI-6973 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6973) Instantiate HoodieFileGroupRecordBuffer inside new file group reader
Ethan Guo created HUDI-6973: --- Summary: Instantiate HoodieFileGroupRecordBuffer inside new file group reader Key: HUDI-6973 URL: https://issues.apache.org/jira/browse/HUDI-6973 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6973) Instantiate HoodieFileGroupRecordBuffer inside new file group reader
[ https://issues.apache.org/jira/browse/HUDI-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6973: --- Assignee: Ethan Guo > Instantiate HoodieFileGroupRecordBuffer inside new file group reader > > > Key: HUDI-6973 > URL: https://issues.apache.org/jira/browse/HUDI-6973 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] EMR 6.13.0 Hudi cleaning throws method not found for SIMS cache [hudi]
subash-metica commented on issue #9909: URL: https://github.com/apache/hudi/issues/9909#issuecomment-1776116310 Upon looking at the error, it is getting triggered for MOR table - which only metadata table is MOR since the base table is COW in my example. Looks like the error is not in cleaning but while performing compaction of metadata table which is MOR. Any leads on how to fix this issue ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6961] Fix deletes with custom delete field in DefaultHoodieRecordPayload [hudi]
yihua commented on PR #9892: URL: https://github.com/apache/hudi/pull/9892#issuecomment-1776087014 @danny0405 I also changed the payload creation logic for Flink. Could you also review the relevant changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] EMR 6.13.0 Hudi cleaning throws method not found for SIMS cache [hudi]
subash-metica opened a new issue, #9909: URL: https://github.com/apache/hudi/issues/9909 Caused by: java.lang.IllegalStateException: com.github.benmanes.caffeine.cache.SIMS at com.github.benmanes.caffeine.cache.LocalCacheFactory.loadFactory(LocalCacheFactory.java:90) ~[__app__.jar:?] at com.github.benmanes.caffeine.cache.LocalCacheFactory.newBoundedLocalCache(LocalCacheFactory.java:40) ~[__app__.jar:?] at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalManualCache.(BoundedLocalCache.java:3947) ~[__app__.jar:?] at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalManualCache.(BoundedLocalCache.java:3943) ~[__app__.jar:?] at com.github.benmanes.caffeine.cache.Caffeine.build(Caffeine.java:1051) ~[__app__.jar:?] at org.apache.hudi.common.util.InternalSchemaCache.(InternalSchemaCache.java:72) ~[hudi-utilities-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1] ... 79 more Caused by: java.lang.NoSuchMethodException: no such constructor: com.github.benmanes.caffeine.cache.SIMS.(Caffeine,AsyncCacheLoader,boolean)void/newInvokeSpecial at java.lang.invoke.MemberName.makeAccessException(MemberName.java:974) ~[?:?] at java.lang.invoke.MemberName$Factory.resolveOrFail(MemberName.java:1117) ~[?:?] at java.lang.invoke.MethodHandles$Lookup.resolveOrFail(MethodHandles.java:3649) ~[?:?] at java.lang.invoke.MethodHandles$Lookup.findConstructor(MethodHandles.java:2750) ~[?:?] at com.github.benmanes.caffeine.cache.LocalCacheFactory.loadFactory(LocalCacheFactory.java:85) ~[__app__.jar:?] at com.github.benmanes.caffeine.cache.LocalCacheFactory.newBoundedLocalCache(LocalCacheFactory.java:40) ~[__app__.jar:?] at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalManualCache.(BoundedLocalCache.java:3947) ~[__app__.jar:?] at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalManualCache.(BoundedLocalCache.java:3943) ~[__app__.jar:?] at com.github.benmanes.caffeine.cache.Caffeine.build(Caffeine.java:1051) ~[__app__.jar:?] at org.apache.hudi.common.util.InternalSchemaCache.(InternalSchemaCache.java:72) ~[hudi-utilities-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1] ... 79 more Caused by: java.lang.NoSuchMethodError: com.github.benmanes.caffeine.cache.SIMS: method 'void (com.github.benmanes.caffeine.cache.Caffeine, com.github.benmanes.caffeine.cache.AsyncCacheLoader, boolean)' not found at java.lang.invoke.MethodHandleNatives.resolve(Native Method) ~[?:?] at java.lang.invoke.MemberName$Factory.resolve(MemberName.java:1085) ~[?:?] **To Reproduce** Steps to reproduce the behavior: 1. Create a COW Hudi table with 10 commits, and then perform delete. The cleaning kicks off but fails with error. **Expected behavior** Successful clean operation **Environment Description** * EMR Version: 6.13.0 * Hudi version : 0.13.1-amz * Spark version : 3.3.2 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** Add any other context about the problem here. **Stacktrace** Caused by: java.lang.IllegalStateException: com.github.benmanes.caffeine.cache.SIMS at com.github.benmanes.caffeine.cache.LocalCacheFactory.loadFactory(LocalCacheFactory.java:90) ~[__app__.jar:?] at com.github.benmanes.caffeine.cache.LocalCacheFactory.newBoundedLocalCache(LocalCacheFactory.java:40) ~[__app__.jar:?] at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalManualCache.(BoundedLocalCache.java:3947) ~[__app__.jar:?] at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalManualCache.(BoundedLocalCache.java:3943) ~[__app__.jar:?] at com.github.benmanes.caffeine.cache.Caffeine.build(Caffeine.java:1051) ~[__app__.jar:?] at org.apache.hudi.common.util.InternalSchemaCache.(InternalSchemaCache.java:72) ~[hudi-utilities-bundle_2.12-0.13.1-amzn-1.jar:0.13.1-amzn-1] ... 79 more Caused by: java.lang.NoSuchMethodException: no such constructor: com.github.benmanes.caffeine.cache.SIMS.(Caffeine,AsyncCacheLoader,boolean)void/newInvokeSpecial at java.lang.invoke.MemberName.makeAccessException(MemberName.java:974) ~[?:?] at java.lang.invoke.MemberName$Factory.resolveOrFail(MemberName.java:1117) ~[?:?] at java.lang.invoke.MethodHandles$Lookup.resolveOrFail(MethodHandles.java:3649) ~[?:?] at java.lang.invoke.MethodHandles$Lookup.findConstructor(MethodHandles.java:2750) ~[?:?] at com.github.benmanes.caffeine.cache.LocalCacheFactory.loadFactory(LocalCacheFactory.java:85) ~[__app__.jar:?] at com.github.benmanes.caffeine.cache.LocalCacheFactory.newBoundedLocalCache(LocalCacheFactory.java:40) ~[__app__.jar:?] at com.github.benmanes.caffeine.cache
[jira] [Commented] (HUDI-6910) Handle schema evolution across base and log files in HoodieFileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17778832#comment-17778832 ] Ethan Guo commented on HUDI-6910: - Part of the changes in HUDI-6801 should fix this. > Handle schema evolution across base and log files in HoodieFileGroupReader > -- > > Key: HUDI-6910 > URL: https://issues.apache.org/jira/browse/HUDI-6910 > Project: Apache Hudi > Issue Type: Task >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > > Goal: When the schema evolves from base to log files, the new > HoodieFileGroupReader should handle the schema evolution within the file > group properly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6910) Handle schema evolution across base and log files in HoodieFileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6910: Status: Patch Available (was: In Progress) > Handle schema evolution across base and log files in HoodieFileGroupReader > -- > > Key: HUDI-6910 > URL: https://issues.apache.org/jira/browse/HUDI-6910 > Project: Apache Hudi > Issue Type: Task >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > > Goal: When the schema evolves from base to log files, the new > HoodieFileGroupReader should handle the schema evolution within the file > group properly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6801) Implement merging of partial updates in FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6801: Reviewers: Danny Chen > Implement merging of partial updates in FileGroupReader > --- > > Key: HUDI-6801 > URL: https://issues.apache.org/jira/browse/HUDI-6801 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6801) Implement merging of partial updates in FileGroupReader
[ https://issues.apache.org/jira/browse/HUDI-6801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6801: Status: Patch Available (was: In Progress) > Implement merging of partial updates in FileGroupReader > --- > > Key: HUDI-6801 > URL: https://issues.apache.org/jira/browse/HUDI-6801 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6956) Fix CI failure on master
[ https://issues.apache.org/jira/browse/HUDI-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo closed HUDI-6956. --- Resolution: Fixed > Fix CI failure on master > > > Key: HUDI-6956 > URL: https://issues.apache.org/jira/browse/HUDI-6956 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > > CI failure in GH action running on Spark 2.4 > {code:java} > 2023-10-18T08:25:11.0927081Z - Test multiple partition fields pruning *** > FAILED *** > 2023-10-18T08:25:11.0928903Z Ā > org.apache.spark.sql.catalyst.parser.ParseException: extraneous input ';' > expecting (line 2, pos 53) > 2023-10-18T08:25:11.0930214ZĀ > 2023-10-18T08:25:11.0930814Z == SQL == > 2023-10-18T08:25:11.0931092ZĀ > 2023-10-18T08:25:11.0931565Z select * from h171 where day='2023-10-12' and > hour=11; > 2023-10-18T08:25:11.0932258Z > -^^^ > 2023-10-18T08:25:11.0933281Z Ā at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241) > 2023-10-18T08:25:11.0934664Z Ā at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117) > 2023-10-18T08:25:11.0935909Z Ā at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) > 2023-10-18T08:25:11.0937200Z Ā at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69) > 2023-10-18T08:25:11.0938893Z Ā at > org.apache.spark.sql.hudi.parser.HoodieSpark2ExtendedSqlParser$$anonfun$parsePlan$1.apply(HoodieSpark2ExtendedSqlParser.scala:45) > 2023-10-18T08:25:11.0940866Z Ā at > org.apache.spark.sql.hudi.parser.HoodieSpark2ExtendedSqlParser$$anonfun$parsePlan$1.apply(HoodieSpark2ExtendedSqlParser.scala:42) > 2023-10-18T08:25:11.0942715Z Ā at > org.apache.spark.sql.hudi.parser.HoodieSpark2ExtendedSqlParser.parse(HoodieSpark2ExtendedSqlParser.scala:80) > 2023-10-18T08:25:11.0944508Z Ā at > org.apache.spark.sql.hudi.parser.HoodieSpark2ExtendedSqlParser.parsePlan(HoodieSpark2ExtendedSqlParser.scala:42) > 2023-10-18T08:25:11.0946437Z Ā at > org.apache.spark.sql.parser.HoodieCommonSqlParser$$anonfun$parsePlan$1.apply(HoodieCommonSqlParser.scala:43) > 2023-10-18T08:25:11.0948031Z Ā at > org.apache.spark.sql.parser.HoodieCommonSqlParser$$anonfun$parsePlan$1.apply(HoodieCommonSqlParser.scala:40) > 2023-10-18T08:25:11.0949087Z Ā ... > 2023-10-18T08:25:31.8632763Z - Test single partiton field pruning *** FAILED > *** > 2023-10-18T08:25:31.8634653Z Ā > org.apache.spark.sql.catalyst.parser.ParseException: extraneous input ';' > expecting (line 2, pos 53) > 2023-10-18T08:25:31.8635951ZĀ > 2023-10-18T08:25:31.8636595Z == SQL == > 2023-10-18T08:25:31.8636881ZĀ > 2023-10-18T08:25:31.8637365Z select * from h172 where day='2023-10-12' and > hour=11; > 2023-10-18T08:25:31.8638064Z > -^^^ > 2023-10-18T08:25:31.8639056Z Ā at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241) > 2023-10-18T08:25:31.8640426Z Ā at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117) > 2023-10-18T08:25:31.8641945Z Ā at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) > 2023-10-18T08:25:31.8643243Z Ā at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69) > 2023-10-18T08:25:31.8644939Z Ā at > org.apache.spark.sql.hudi.parser.HoodieSpark2ExtendedSqlParser$$anonfun$parsePlan$1.apply(HoodieSpark2ExtendedSqlParser.scala:45) > 2023-10-18T08:25:31.8646914Z Ā at > org.apache.spark.sql.hudi.parser.HoodieSpark2ExtendedSqlParser$$anonfun$parsePlan$1.apply(HoodieSpark2ExtendedSqlParser.scala:42) > 2023-10-18T08:25:31.8648770Z Ā at > org.apache.spark.sql.hudi.parser.HoodieSpark2ExtendedSqlParser.parse(HoodieSpark2ExtendedSqlParser.scala:80) > 2023-10-18T08:25:31.8650554Z Ā at > org.apache.spark.sql.hudi.parser.HoodieSpark2ExtendedSqlParser.parsePlan(HoodieSpark2ExtendedSqlParser.scala:42) > 2023-10-18T08:25:31.8652258Z Ā at > org.apache.spark.sql.parser.HoodieCommonSqlParser$$anonfun$parsePlan$1.apply(HoodieCommonSqlParser.scala:43) > 2023-10-18T08:25:31.8653871Z Ā at > org.apache.spark.sql.parser.HoodieCommonSqlParser$$anonfun$parsePlan$1.apply(HoodieCommonSqlParser.scala:40) > 2023-10-18T08:25:31.8654880Z Ā ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6482] Supports new compaction strategy DayBasedAndBoundedIOCompactionStrategy [hudi]
yihua commented on PR #9126: URL: https://github.com/apache/hudi/pull/9126#issuecomment-1775900919 @ksmou do you still plan to revise this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (HUDI-6972) Fix redirection to individual config links
[ https://issues.apache.org/jira/browse/HUDI-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha resolved HUDI-6972. - > Fix redirection to individual config links > -- > > Key: HUDI-6972 > URL: https://issues.apache.org/jira/browse/HUDI-6972 > Project: Apache Hudi > Issue Type: Task >Reporter: Bhavani Sudha >Assignee: Bhavani Sudha >Priority: Minor > Labels: docs, pull-request-available > > Currently, the links for configs are not working as expected. The top of the > page is rendered instead of the actual config section. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch asf-site updated: [HUDI-6972][DOCS] Fix config link redirection (#9908)
This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 729dac981de [HUDI-6972][DOCS] Fix config link redirection (#9908) 729dac981de is described below commit 729dac981deaca25e0c4fcce98eab18c0f6ac5d7 Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com> AuthorDate: Mon Oct 23 12:32:37 2023 -0700 [HUDI-6972][DOCS] Fix config link redirection (#9908) --- website/src/theme/DocPage/index.js | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/website/src/theme/DocPage/index.js b/website/src/theme/DocPage/index.js index 552adcfa357..a8b5bf2ea36 100644 --- a/website/src/theme/DocPage/index.js +++ b/website/src/theme/DocPage/index.js @@ -4,7 +4,7 @@ * This source code is licensed under the MIT license found in the * LICENSE file in the root directory of this source tree. */ -import React, {useState, useCallback} from 'react'; +import React, {useState, useCallback, useEffect} from 'react'; import {MDXProvider} from '@mdx-js/react'; import renderRoutes from '@docusaurus/renderRoutes'; import Layout from '@theme/Layout'; @@ -44,6 +44,27 @@ function DocPageContent({ setHiddenSidebarContainer((value) => !value); }, [hiddenSidebar]); + if(typeof window !== 'undefined') { + useEffect(() => { + const timeout = setTimeout(() => { +const [_, hashValue] = window.location.href.split('#'); + +const element = document.querySelectorAll(`[href="#${hashValue}"]`)?.[0]; +if(element) { + const headerOffset = 90; + const elementPosition = element.getBoundingClientRect().top; + const offsetPosition = elementPosition + window.pageYOffset - headerOffset; + window.scrollTo({ +top: offsetPosition + }); +} + }, 100); + + return () => { +clearTimeout(timeout); + } + }, [window.location.href]); + } return (
Re: [PR] [HUDI-6972][DOCS] Fix config link redirection [hudi]
bhasudha merged PR #9908: URL: https://github.com/apache/hudi/pull/9908 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6967] Add clearJobStatus api in HoodieEngineContext [hudi]
yihua commented on code in PR #9899: URL: https://github.com/apache/hudi/pull/9899#discussion_r1369141606 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java: ## @@ -215,6 +219,7 @@ protected List> loadColumnRangesFromMetaIndex( String keyField = hoodieTable.getMetaClient().getTableConfig().getRecordKeyFieldProp(); List> baseFilesForAllPartitions = HoodieIndexUtils.getLatestBaseFilesForAllPartitions(partitions, context, hoodieTable); +context.clearJobStatus(); Review Comment: This shouldn't be added. Key range loading has not finished here. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java: ## @@ -758,7 +762,6 @@ protected void reconcileAgainstMarkers(HoodieEngineContext context, } // Now delete partially written files -context.setJobStatus(this.getClass().getSimpleName(), "Delete all partially written files: " + config.getTableName()); Review Comment: Why deleting this? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseWriteHelper.java: ## @@ -61,6 +61,7 @@ public HoodieWriteMetadata write(String instantTime, // perform index loop up to get existing location of records context.setJobStatus(this.getClass().getSimpleName(), "Tagging: " + table.getConfig().getTableName()); taggedRecords = tag(dedupedRecords, context, table); +context.clearJobStatus(); Review Comment: If lazy execution happens afterwards, the job status may not be properly populated. Have you verified all places that this won't happen? ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java: ## @@ -111,44 +111,48 @@ public BaseSparkCommitActionExecutor(HoodieEngineContext context, private HoodieData> clusteringHandleUpdate(HoodieData> inputRecords) { context.setJobStatus(this.getClass().getSimpleName(), "Handling updates which are under clustering: " + config.getTableName()); -Set fileGroupsInPendingClustering = - table.getFileSystemView().getFileGroupsInPendingClustering().map(Pair::getKey).collect(Collectors.toSet()); -// Skip processing if there is no inflight clustering -if (fileGroupsInPendingClustering.isEmpty()) { - return inputRecords; -} +try { + Set fileGroupsInPendingClustering = + table.getFileSystemView().getFileGroupsInPendingClustering().map(Pair::getKey).collect(Collectors.toSet()); + // Skip processing if there is no inflight clustering + if (fileGroupsInPendingClustering.isEmpty()) { +return inputRecords; + } -UpdateStrategy>> updateStrategy = (UpdateStrategy>>) ReflectionUtils -.loadClass(config.getClusteringUpdatesStrategyClass(), new Class[] {HoodieEngineContext.class, HoodieTable.class, Set.class}, -this.context, table, fileGroupsInPendingClustering); -// For SparkAllowUpdateStrategy with rollback pending clustering as false, need not handle -// the file group intersection between current ingestion and pending clustering file groups. -// This will be handled at the conflict resolution strategy. -if (updateStrategy instanceof SparkAllowUpdateStrategy && !config.isRollbackPendingClustering()) { - return inputRecords; -} -Pair>, Set> recordsAndPendingClusteringFileGroups = -updateStrategy.handleUpdate(inputRecords); + UpdateStrategy>> updateStrategy = (UpdateStrategy>>) ReflectionUtils + .loadClass(config.getClusteringUpdatesStrategyClass(), new Class[] {HoodieEngineContext.class, HoodieTable.class, Set.class}, + this.context, table, fileGroupsInPendingClustering); + // For SparkAllowUpdateStrategy with rollback pending clustering as false, need not handle + // the file group intersection between current ingestion and pending clustering file groups. + // This will be handled at the conflict resolution strategy. + if (updateStrategy instanceof SparkAllowUpdateStrategy && !config.isRollbackPendingClustering()) { +return inputRecords; + } + Pair>, Set> recordsAndPendingClusteringFileGroups = + updateStrategy.handleUpdate(inputRecords); -Set fileGroupsWithUpdatesAndPendingClustering = recordsAndPendingClusteringFileGroups.getRight(); -if (fileGroupsWithUpdatesAndPendingClustering.isEmpty()) { + Set fileGroupsWithUpdatesAndPendingClustering = recordsAndPendingClusteringFileGroups.getRight(); + if (fileGroupsWithUpdatesAndPendingClustering.isEmpty()) { +return recordsAndPendingClusteringFileGroups.getLeft(); + } + // there are file groups pending clustering and receiving updates, so rollback the pending clustering instants + // there could be race condition, for example, if the clustering completes aft
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1369126255 ## hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieBaseFileGroupRecordBuffer.java: ## @@ -126,12 +128,13 @@ protected Option doProcessNextDataRecord(T record, // Merge and store the combined record // Note that the incoming `record` is from an older commit, so it should be put as // the `older` in the merge API + HoodieRecord combinedRecord = (HoodieRecord) recordMerger.merge( - readerContext.constructHoodieRecord(Option.of(record), metadata, readerSchema), - readerSchema, + readerContext.constructHoodieRecord(Option.of(record), metadata), + (Schema) metadata.get(INTERNAL_META_SCHEMA), readerContext.constructHoodieRecord( - existingRecordMetadataPair.getLeft(), existingRecordMetadataPair.getRight(), readerSchema), - readerSchema, + existingRecordMetadataPair.getLeft(), existingRecordMetadataPair.getRight()), + (Schema) existingRecordMetadataPair.getRight().get(INTERNAL_META_SCHEMA), payloadProps).get().getLeft(); Review Comment: When there are more log files, partial updates, and schema evolution, `(Schema) metadata.get(INTERNAL_META_SCHEMA)` can be different across record keys. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1369119288 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/BaseSparkInternalRowReaderContext.java: ## @@ -94,17 +94,18 @@ public Comparable getOrderingValue(Option rowOption, @Override public HoodieRecord constructHoodieRecord(Option rowOption, - Map metadataMap, - Schema schema) { + Map metadataMap) { if (!rowOption.isPresent()) { return new HoodieEmptyRecord<>( new HoodieKey((String) metadataMap.get(INTERNAL_META_RECORD_KEY), (String) metadataMap.get(INTERNAL_META_PARTITION_PATH)), HoodieRecord.HoodieRecordType.SPARK); } +Schema schema = (Schema) metadataMap.get(INTERNAL_META_SCHEMA); InternalRow row = rowOption.get(); -return new HoodieSparkRecord(row, HoodieInternalRowUtils.getCachedSchema(schema)); +boolean isPartial = (boolean) metadataMap.getOrDefault(INTERNAL_META_IS_PARTIAL, false); +return new HoodieSparkRecord(row, HoodieInternalRowUtils.getCachedSchema(schema), isPartial); Review Comment: Reason mentioned above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6801] Implement merging partial updates from log files for MOR tables [hudi]
yihua commented on code in PR #9883: URL: https://github.com/apache/hudi/pull/9883#discussion_r1369118091 ## hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecord.java: ## @@ -195,6 +206,10 @@ public HoodieKey getKey() { return key; } + public boolean isPartial() { +return isPartial; Review Comment: `isPartial` is determined at the commit or write batch level, but for record merging to work in the current implementation and maintain the layering, it's better to have the flag at the record level. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1369060242 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala: ## @@ -411,10 +414,14 @@ object ExpressionPayload { parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA)) } - private def getWriterSchema(props: Properties): Schema = { - ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key), - s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property") -parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key)) + private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): Schema = { +if (isPartialUpdate) { + parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key)) Review Comment: In this PR, for updates in MOR tables, after processing the Spark SQL MERGE INTO statement, the writer gets the updates with partial schema and pass them to the `HoodieAppendHandle`. Regardless, the original intent to include `FULL_SCHEMA` is for merging partial updates at the reader side. If we assume that values for a non-updated column should be either existing value (column in the existing schema) or null (new column in the evolved schema) in merging partial updates, the `FULL_SCHEMA` may not be stored in the log block header. See the following examples: ``` Example 1: base file: schema (col1, col2) (full schema at this instant: (col1, col2)) log 1: partial, schema (col2, col3) (full schema at this instant: (col1, col2, col3)) after log merging: schema (col1, col2, col3) (col1 values from base file, col2, col3 values from log1 for overwrite with latest) Example 2: base file: schema (col1, col2) (full schema at this instant: (col1, col2)) log 1: partial, schema (col2, col3) (full schema at this instant: (col1, col2, col3, col4)) after log merging: schema (col1, col2, col3) project to full schema: (col1, col2, col3) -> (col1, col2, col3, col4), with nulls in col4 (col1 values from base file, col2, col3 values from log1 for overwrite with latest, col4 has nulls) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1369039030 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -652,6 +660,16 @@ private static Map getUpdatedHeader(Map
Re: [PR] [HUDI-6972][DOCS] Fix config link redirection [hudi]
bhasudha commented on PR #9908: URL: https://github.com/apache/hudi/pull/9908#issuecomment-1775287986 Tested locally 2 things: 1. Within configs page clicking any config link renders it properly. Shown here after clicking. 2. Tested redirection to specific configs from other pages. Cannot show the test here since it would need a video screen capture. Test for 1. described above. https://github.com/apache/hudi/assets/2179254/ab1d1fad-110a-4316-8452-5c125c80";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6972) Fix redirection to individual config links
[ https://issues.apache.org/jira/browse/HUDI-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6972: - Labels: docs pull-request-available (was: docs) > Fix redirection to individual config links > -- > > Key: HUDI-6972 > URL: https://issues.apache.org/jira/browse/HUDI-6972 > Project: Apache Hudi > Issue Type: Task >Reporter: Bhavani Sudha >Assignee: Bhavani Sudha >Priority: Minor > Labels: docs, pull-request-available > > Currently, the links for configs are not working as expected. The top of the > page is rendered instead of the actual config section. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-6972][DOCS] Fix config link redirection [hudi]
bhasudha opened a new pull request, #9908: URL: https://github.com/apache/hudi/pull/9908 ### Change Logs website fixes to ensure config links are working as expected. ### Impact website changes ### Risk level (write none, low medium or high below) Low ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6972) Fix redirection to individual config links
[ https://issues.apache.org/jira/browse/HUDI-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-6972: Status: In Progress (was: Open) > Fix redirection to individual config links > -- > > Key: HUDI-6972 > URL: https://issues.apache.org/jira/browse/HUDI-6972 > Project: Apache Hudi > Issue Type: Task >Reporter: Bhavani Sudha >Assignee: Bhavani Sudha >Priority: Minor > Labels: docs > > Currently, the links for configs are not working as expected. The top of the > page is rendered instead of the actual config section. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6972) Fix redirection to individual config links
Bhavani Sudha created HUDI-6972: --- Summary: Fix redirection to individual config links Key: HUDI-6972 URL: https://issues.apache.org/jira/browse/HUDI-6972 Project: Apache Hudi Issue Type: Task Reporter: Bhavani Sudha Currently, the links for configs are not working as expected. The top of the page is rendered instead of the actual config section. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6972) Fix redirection to individual config links
[ https://issues.apache.org/jira/browse/HUDI-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-6972: Priority: Minor (was: Major) > Fix redirection to individual config links > -- > > Key: HUDI-6972 > URL: https://issues.apache.org/jira/browse/HUDI-6972 > Project: Apache Hudi > Issue Type: Task >Reporter: Bhavani Sudha >Priority: Minor > Labels: docs > > Currently, the links for configs are not working as expected. The top of the > page is rendered instead of the actual config section. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6972) Fix redirection to individual config links
[ https://issues.apache.org/jira/browse/HUDI-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha reassigned HUDI-6972: --- Assignee: Bhavani Sudha > Fix redirection to individual config links > -- > > Key: HUDI-6972 > URL: https://issues.apache.org/jira/browse/HUDI-6972 > Project: Apache Hudi > Issue Type: Task >Reporter: Bhavani Sudha >Assignee: Bhavani Sudha >Priority: Minor > Labels: docs > > Currently, the links for configs are not working as expected. The top of the > page is rendered instead of the actual config section. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-6112) Improve Doc generatiion to generate config tables for basic and advanced configs
[ https://issues.apache.org/jira/browse/HUDI-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha resolved HUDI-6112. - > Improve Doc generatiion to generate config tables for basic and advanced > configs > > > Key: HUDI-6112 > URL: https://issues.apache.org/jira/browse/HUDI-6112 > Project: Apache Hudi > Issue Type: Task >Reporter: Bhavani Sudha >Assignee: Bhavani Sudha >Priority: Major > Labels: pull-request-available > Fix For: 0.14.1 > > > The HoodieConfigDocGenerator will need to be modified such that: > * Each config group has two sections: basic configs and advanced configs > * Basic configs and Advanced configs are played out in a table instead of a > serially like today. > * Among each of these tables the required configs are bubbled up to the top > of the table and highlighted. > Add UI fixes to support a table layout -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6970] Stream read allows skipping archived commits [hudi]
hudi-bot commented on PR #9905: URL: https://github.com/apache/hudi/pull/9905#issuecomment-1775266175 ## CI report: * 31be10290de4f6bbc9ecd385202ee9c1d655eac2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20444) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6969] Add speed limit for stream read [hudi]
hudi-bot commented on PR #9904: URL: https://github.com/apache/hudi/pull/9904#issuecomment-1775266079 ## CI report: * 23af1b3753a523ffd717b7fb56a87501f3327adf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20443) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6971] OOM caused by configuring read.start_commit as earliest in stream reading [hudi]
hudi-bot commented on PR #9906: URL: https://github.com/apache/hudi/pull/9906#issuecomment-1775218968 ## CI report: * 28cd284a93f70e853ae3d9373fd01df3aa5c12cf Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20445) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6866]When invalidate the table in the spark sql query cache, verify if theā¦ [hudi]
zhangyue19921010 merged PR #9425: URL: https://github.com/apache/hudi/pull/9425 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (bb8fc3e9f63 -> fe010bb1855)
This is an automated email from the ASF dual-hosted git repository. zhangyue19921010 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from bb8fc3e9f63 [HUDI-6929] Lazy loading dynamically for CompletionTimeQueryView (#9898) add fe010bb1855 When invalidate the table in the spark sql query cache, verify if the hive-async database exists (#9425) No new revisions were added by this update. Summary of changes: .../src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala| 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
Re: [I] [SUPPORT] AWS Athena query fail when compaction is scheduled for MOR table [hudi]
ad1happy2go commented on issue #9907: URL: https://github.com/apache/hudi/issues/9907#issuecomment-1775119269 @brightwon Interesting. Thanks for raising this. Looks like a regression. Can you provide full stack trace. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] AWS Athena query fail when compaction is scheduled for MOR table [hudi]
brightwon commented on issue #9907: URL: https://github.com/apache/hudi/issues/9907#issuecomment-1775080933 Now, I downgraded my Hudi version to 0.13.1 and the error no longer occurs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] AWS Athena query fail when compaction is scheduled for MOR table [hudi]
brightwon opened a new issue, #9907: URL: https://github.com/apache/hudi/issues/9907 I'm using hudi 0.14.0 with flink 1.16.1 to store data from kafka to s3. but Athena(Engine 3) query to MOR table is not working because of this error. ``` Error running query: HIVE_UNKNOWN_ERROR: io.trino.plugin.hive.s3.TrinoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: ***; S3 Extended Request ID: ***; Proxy: null), S3 Extended Request ID: *** (Bucket: mybucket, Key: mytable/.hoodie/.aux/20231014095517882.compaction.requested) ``` This error occurs if compaction is scheduled. After compaction is complete, query is working. Here's flink hudi option (Java) ``` flinkHudiOptions.put(FlinkOptions.PATH.key(), basePath); flinkHudiOptions.put(FlinkOptions.TABLE_TYPE.key(), HoodieTableType.MERGE_ON_READ.name()); flinkHudiOptions.put(FlinkOptions.OPERATION.key(), WriteOperationType.UPSERT.name()); flinkHudiOptions.put(FlinkOptions.PRECOMBINE_FIELD.key(), "event_time"); flinkHudiOptions.put(FlinkOptions.KEYGEN_CLASS_NAME.key(), "org.apache.hudi.keygen.ComplexKeyGenerator"); flinkHudiOptions.put(FlinkOptions.COMPACTION_ASYNC_ENABLED.key(), "true"); flinkHudiOptions.put(FlinkOptions.COMPACTION_TRIGGER_STRATEGY.key(), FlinkOptions.NUM_COMMITS); flinkHudiOptions.put(FlinkOptions.COMPACTION_DELTA_COMMITS.key(), "5"); flinkHudiOptions.put(FlinkOptions.COMPACTION_MAX_MEMORY.key(), "1024"); flinkHudiOptions.put(FlinkOptions.METADATA_ENABLED.key(), "true"); flinkHudiOptions.put(HoodieMetadataConfig.ASYNC_INDEX_ENABLE.key(), "true"); flinkHudiOptions.put(HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key(), "true"); flinkHudiOptions.put(HoodieWriteConfig.WRITE_CONCURRENCY_MODE.key(), WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL.name()); flinkHudiOptions.put(HoodieLockConfig.LOCK_PROVIDER_CLASS_NAME.key(), "org.apache.hudi.client.transaction.lock.InProcessLockProvider"); flinkHudiOptions.put(FlinkOptions.CLEAN_ASYNC_ENABLED.key(), "true"); flinkHudiOptions.put(FlinkOptions.CLEAN_POLICY.key(), HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS.name()); flinkHudiOptions.put(FlinkOptions.CLEAN_RETAIN_HOURS.key(), "24"); ``` My flink application works on flink-operator's FlinkDeployment (on AWS EKS). I ran the hive-sync command once in EMR 6.10.0 (Hudi 0.12.2-amzn-0 version) for easy use of Glue MetaStore. **To Reproduce** Steps to reproduce the behavior: 1. run flink application with above options 2. run hive-sync once for making using hive-sync on EMR 3. run athena query when compaction is scheduled **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.14.0 * Flink version : 1.16.1 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] HoodieCompaction with schema parse NullPointerException [hudi]
ad1happy2go commented on issue #9902: URL: https://github.com/apache/hudi/issues/9902#issuecomment-1775072273 @zyclove Thanks for raising this. Looks like compaction is throwing out this Exception with those schema configuration. I will try to triage this. Can you help us with some sample data or sample script which can help us to reproduce this issue. I tried to reproduce using below code and see compaction happening fine - ``` SET hoodie.schema.on.read.enable=true; SET hoodie.datasource.write.reconcile.schema=true; SET hoodie.avro.schema.validate=true; SET hoodie.datasource.write.new.columns.nullable=true; CREATE TABLE hudi_table ( ts BIGINT, uuid STRING, rider STRING, driver STRING, fare DECIMAL(10,4), city STRING ) USING HUDI tblproperties ( type = 'mor', primaryKey = 'uuid', preCombineField = 'ts' ,hoodie.datasource.write.new.columns.nullable = 'true' ,hoodie.avro.schema.validate = 'true' ,hoodie.schema.on.read.enable = 'true' ,hoodie.datasource.write.reconcile.schema = 'true' ) PARTITIONED BY (city); -- Tried multiple insert commands with multiple values and confirmed compaction is happening fine. INSERT INTO hudi_table VALUES (1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',11.0001,'san_francisco'), (1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',11.0001 ,'san_francisco'); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [BUG]hudi cli command with Wrong FS error [hudi]
ad1happy2go commented on issue #9903: URL: https://github.com/apache/hudi/issues/9903#issuecomment-1775045166 @zyclove Are you able to run other cli commands fine, just to check if S3 connection is fine from cli -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] ERROR BaseSparkCommitActionExecutor: Error upserting bucketType UPDATE for partition :13 [hudi]
zyclove commented on issue #9119: URL: https://github.com/apache/hudi/issues/9119#issuecomment-1775028059 @danny0405 This problem still exists in version 014 too, how to solve it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]spark-sql MOR query error with org.apache.avro.SchemaParseException: Cannot parse schema [hudi]
zyclove commented on issue #9016: URL: https://github.com/apache/hudi/issues/9016#issuecomment-1775022886 @ad1happy2go This problem still exists in version 014, how to solve it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6971] OOM caused by configuring read.start_commit as earliest in stream reading [hudi]
hudi-bot commented on PR #9906: URL: https://github.com/apache/hudi/pull/9906#issuecomment-1775007181 ## CI report: * 28cd284a93f70e853ae3d9373fd01df3aa5c12cf Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20445) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6971] OOM caused by configuring read.start_commit as earliest in stream reading [hudi]
hudi-bot commented on PR #9906: URL: https://github.com/apache/hudi/pull/9906#issuecomment-1774940343 ## CI report: * 28cd284a93f70e853ae3d9373fd01df3aa5c12cf UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6962] Fix the conflicts resolution for bulk insert under NB-CC [hudi]
hudi-bot commented on PR #9896: URL: https://github.com/apache/hudi/pull/9896#issuecomment-1774940114 ## CI report: * 9ab01f405b75097cb3d1c610d7e47c0eed92b10d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20442) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6969] Add speed limit for stream read [hudi]
hudi-bot commented on PR #9904: URL: https://github.com/apache/hudi/pull/9904#issuecomment-1774940196 ## CI report: * 23af1b3753a523ffd717b7fb56a87501f3327adf Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20443) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6970] Stream read allows skipping archived commits [hudi]
hudi-bot commented on PR #9905: URL: https://github.com/apache/hudi/pull/9905#issuecomment-1774940259 ## CI report: * 31be10290de4f6bbc9ecd385202ee9c1d655eac2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20444) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6969] Add speed limit for stream read [hudi]
hudi-bot commented on PR #9904: URL: https://github.com/apache/hudi/pull/9904#issuecomment-1774925985 ## CI report: * 23af1b3753a523ffd717b7fb56a87501f3327adf UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6970] Stream read allows skipping archived commits [hudi]
hudi-bot commented on PR #9905: URL: https://github.com/apache/hudi/pull/9905#issuecomment-1774926097 ## CI report: * 31be10290de4f6bbc9ecd385202ee9c1d655eac2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] oom [hudi]
zhuanshenbsj1 opened a new pull request, #9906: URL: https://github.com/apache/hudi/pull/9906 ### Change Logs 1.When you set the conf read.start_commit as earliest, https://github.com/apache/hudi/blob/bb8fc3e9f632a1fc3647fda63d482849355df2b7/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java#L410-L428 the method getInstantRange will return null, https://github.com/apache/hudi/blob/bb8fc3e9f632a1fc3647fda63d482849355df2b7/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java#L289-L298 which will cause all partitions and files to be loaded subsequently, which is unreasonable. 2.Due to developers being accustomed to consuming Kafka, they always prefer to set the consumption starting point to last test ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-6970] Stream read allows skipping archived commits [hudi]
zhuanshenbsj1 opened a new pull request, #9905: URL: https://github.com/apache/hudi/pull/9905 ### Change Logs The current code version, if the commit has already been archived, will still be read. In most scenarios, cleaning is done before archiving (except for compaction), so it is generally not necessary to read the archived metadata. Moreover, if startcommit is set too early, it can load a large number of unnecessary commits, resulting in OOM. ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-6969] Add speed limit for stream read [hudi]
zhuanshenbsj1 opened a new pull request, #9904: URL: https://github.com/apache/hudi/pull/9904 ### Change Logs Currently, there is no speed limit for stream read, and regardless of the instantranges, they will be read at once. It is easy to cause GC of monitor operator. Add a configuration to limit the number of commits read per round in stream read mode. ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
codope commented on code in PR #9761: URL: https://github.com/apache/hudi/pull/9761#discussion_r1368419186 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMultiFileFormatRelation.scala: ## @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hudi.HoodieBaseRelation.projectReader +import org.apache.hudi.HoodieConversionUtils.toScalaOption +import org.apache.hudi.HoodieMultiFileFormatRelation.{createPartitionedFile, inferFileFormat} +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.{FileSlice, HoodieFileFormat, HoodieLogFile} +import org.apache.hudi.common.table.HoodieTableMetaClient +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.SQLContext +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.Expression +import org.apache.spark.sql.execution.datasources.{FilePartition, PartitionedFile} +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.types.StructType + +import scala.jdk.CollectionConverters.asScalaIteratorConverter + +/** + * Base split for all Hoodie multi-file format relations. + */ +case class HoodieMultiFileFormatSplit(baseFile: Option[PartitionedFile], + logFiles: List[HoodieLogFile]) extends HoodieFileSplit + +/** + * Base relation to handle table with multiple base file formats. + */ +abstract class BaseHoodieMultiFileFormatRelation(override val sqlContext: SQLContext, + override val metaClient: HoodieTableMetaClient, Review Comment: Discussed offline. We think that implementing a new `FileFormat` which works with multiple base file formats should be possible. So, i'm going to attempt that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6971) OOM caused by configuring read.start_commit as earliest in stream reading
zhuanshenbsj1 created HUDI-6971: --- Summary: OOM caused by configuring read.start_commit as earliest in stream reading Key: HUDI-6971 URL: https://issues.apache.org/jira/browse/HUDI-6971 Project: Apache Hudi Issue Type: Improvement Components: reader-core Reporter: zhuanshenbsj1 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6970) Stream read allows skipping archived commits
zhuanshenbsj1 created HUDI-6970: --- Summary: Stream read allows skipping archived commits Key: HUDI-6970 URL: https://issues.apache.org/jira/browse/HUDI-6970 Project: Apache Hudi Issue Type: Improvement Components: reader-core Reporter: zhuanshenbsj1 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6969) Add speed limit for stream read
zhuanshenbsj1 created HUDI-6969: --- Summary: Add speed limit for stream read Key: HUDI-6969 URL: https://issues.apache.org/jira/browse/HUDI-6969 Project: Apache Hudi Issue Type: Improvement Components: reader-core Reporter: zhuanshenbsj1 -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6821] Support multiple base file formats in Hudi table [hudi]
hudi-bot commented on PR #9761: URL: https://github.com/apache/hudi/pull/9761#issuecomment-1774816678 ## CI report: * 89e72e0fdf9229f34d23ee7245676eaa9a323418 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20440) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org