[GitHub] [hudi] stream2000 commented on a diff in pull request #9350: [HUDI-2141] Support flink read metrics
stream2000 commented on code in PR #9350: URL: https://github.com/apache/hudi/pull/9350#discussion_r1282707245 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadOperator.java: ## @@ -168,6 +174,8 @@ private void processSplits() throws IOException { currentSplitState = SplitState.IDLE; } +readMetrics.setSplitLatestCommit(split.getLatestCommit()); + Review Comment: The metrics will be updated for every new split. ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java: ## @@ -262,9 +268,16 @@ public void snapshotState(FunctionSnapshotContext context) throws Exception { this.instantState.clear(); if (this.issuedInstant != null) { this.instantState.add(this.issuedInstant); + this.readMetrics.setIssuedInstant(this.issuedInstant); } if (this.issuedOffset != null) { Review Comment: The metrics will be updated for each checkpoint -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] PhantomHunt commented on issue #9344: [SUPPORT] Getting error when writing to different HUDI tables in different threads in same job
PhantomHunt commented on issue #9344: URL: https://github.com/apache/hudi/issues/9344#issuecomment-1663332820 We have a Job running on EC2 ubuntu machine that upserts data into 2 hudi tables parallelly in 2 threads (using threadPoolExecutor in the concurrent library of Python) at a time. There are 17 tables in total. When upsertion in any one of the tables is finished, threadPoolExecutor takes in another table to process in the available free thread. The Job terminates when upsertion in all 17 tables finishes. This job runs every 5 mins via cronjob. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6634) Add support for schemaProvider in CloudObjectsSelectorCommon
Harshal Patil created HUDI-6634: --- Summary: Add support for schemaProvider in CloudObjectsSelectorCommon Key: HUDI-6634 URL: https://issues.apache.org/jira/browse/HUDI-6634 Project: Apache Hudi Issue Type: Improvement Reporter: Harshal Patil There should be way to give schema while loading files from CloudObjects . -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9347: Upgrade aws java sdk to v2
hudi-bot commented on PR #9347: URL: https://github.com/apache/hudi/pull/9347#issuecomment-1663324203 ## CI report: * d2360a5a7de655991202680013d20268ce325666 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19016) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9209: [HUDI-6539] New LSM tree style archived timeline
hudi-bot commented on PR #9209: URL: https://github.com/apache/hudi/pull/9209#issuecomment-1663323825 ## CI report: * 8f2dc4ec3e26f1908ae5d15f194bf70ca7dab27e UNKNOWN * 4ade37c10c908c0422915aaa489208e6ee62bb0d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18997) * 57c1b843608a9b63d143ead5dd5168613bb13969 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19027) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.
danny0405 commented on issue #8892: URL: https://github.com/apache/hudi/issues/8892#issuecomment-1663319058 @voonhous It is great if you can put this issue in higher priority, there are still 2 days for 0.14.0 release code freeze. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9330: [HUDI-6622] Reuse the table config from HoodieTableMetaClient in the …
hudi-bot commented on PR #9330: URL: https://github.com/apache/hudi/pull/9330#issuecomment-1663317943 ## CI report: * 53e9bab71f8766ff092f7109abf6232098e0084c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18990) * 38aec912160b7531914cd4c07ea8317606f34616 UNKNOWN * d6d32a693c455830a31b883915e9940fa309c77f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19026) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9209: [HUDI-6539] New LSM tree style archived timeline
hudi-bot commented on PR #9209: URL: https://github.com/apache/hudi/pull/9209#issuecomment-1663317607 ## CI report: * 8f2dc4ec3e26f1908ae5d15f194bf70ca7dab27e UNKNOWN * 4ade37c10c908c0422915aaa489208e6ee62bb0d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18997) * 57c1b843608a9b63d143ead5dd5168613bb13969 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9330: [HUDI-6622] Reuse the table config from HoodieTableMetaClient in the …
danny0405 commented on code in PR #9330: URL: https://github.com/apache/hudi/pull/9330#discussion_r1282663938 ## hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java: ## @@ -690,7 +690,7 @@ private static HoodieTableMetaClient newMetaClient(Configuration conf, String ba ? (HoodieTableMetaClient) ReflectionUtils.loadClass("org.apache.hudi.common.table.HoodieTableMetaserverClient", new Class[]{Configuration.class, String.class, ConsistencyGuardConfig.class, String.class, FileSystemRetryConfig.class, String.class, String.class, HoodieMetaserverConfig.class}, conf, basePath, consistencyGuardConfig, recordMergerStrategy, fileSystemRetryConfig, -metaserverConfig.getDatabaseName(), metaserverConfig.getTableName(), metaserverConfig) +Option.of(metaserverConfig.getDatabaseName()), Option.of(metaserverConfig.getTableName()), metaserverConfig) : new HoodieTableMetaClient(conf, basePath, Review Comment: How could the option be empty? Maybe you should use `Option.ofNullable` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9330: [HUDI-6622] Reuse the table config from HoodieTableMetaClient in the …
hudi-bot commented on PR #9330: URL: https://github.com/apache/hudi/pull/9330#issuecomment-1663312118 ## CI report: * 53e9bab71f8766ff092f7109abf6232098e0084c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18990) * 38aec912160b7531914cd4c07ea8317606f34616 UNKNOWN * d6d32a693c455830a31b883915e9940fa309c77f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.
hudi-bot commented on PR #9324: URL: https://github.com/apache/hudi/pull/9324#issuecomment-1663312046 ## CI report: * 98e49fad21b4c7b1151e96c7a72b18caf5014a7f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18933) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18949) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18965) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18983) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19014) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9330: [HUDI-6622] Reuse the table config from HoodieTableMetaClient in the …
hudi-bot commented on PR #9330: URL: https://github.com/apache/hudi/pull/9330#issuecomment-1663279883 ## CI report: * 53e9bab71f8766ff092f7109abf6232098e0084c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18990) * 38aec912160b7531914cd4c07ea8317606f34616 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9261: [HUDI-6579] Adding support for upsert and deletes with spark datasource for pk less table
hudi-bot commented on PR #9261: URL: https://github.com/apache/hudi/pull/9261#issuecomment-1663274558 ## CI report: * 5b6c8a9f7e241fb76bc7112881e0a9cbbeb07a12 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19012) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] big-doudou commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.
big-doudou commented on issue #8892: URL: https://github.com/apache/hudi/issues/8892#issuecomment-1663269690 > I think he means check why f#inalizeWrite is not picking up the files to be deleted upon commit? It would be great if there is a lighter solution, otherwise my task still needs to be rollback -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] big-doudou commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.
big-doudou commented on issue #8892: URL: https://github.com/apache/hudi/issues/8892#issuecomment-1663268905 > 我认为他的意思是检查为什么 f#inalizeWrite 没有选择提交时要删除的文件? yes Because those files are not visible to #getLatestFileSlices -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] voonhous commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.
voonhous commented on issue #8892: URL: https://github.com/apache/hudi/issues/8892#issuecomment-1663266922 I think he means check why f#inalizeWrite is not picking up the files to be deleted upon commit? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs
[ https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750563#comment-17750563 ] Sagar Sumit commented on HUDI-6596: --- [~krishen] Overall, your proposed approach seems robust and thoughtful. Few considerations: > Acquire the table lock The table lock could become a bottleneck, potentially leading to performance issues as other operations might be blocked too. It might be useful to consider how frequently you expect concurrent rollbacks to occur and whether this might create a performance problem. > check for an active heartbeat for the rollback instant time. If there is one, > then abort the rollback as that means there is a concurrent job executing > that rollback. Worth considering edge cases where heartbeats could become stale or be missed (e.g., if a job crashes without properly closing its heartbeat). Handling these scenarios gracefully will help ensure that rollbacks can still proceed when needed. Can we ensure rollbacks are idempotent in case of repeated failures or retries? > Propose rollback implementation changes to guard against concurrent jobs > - > > Key: HUDI-6596 > URL: https://issues.apache.org/jira/browse/HUDI-6596 > Project: Apache Hudi > Issue Type: Wish >Reporter: Krishen Bhan >Priority: Trivial > > h1. Issue > The existing rollback API in 0.14 > [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877] > executes a rollback plan, either taking in an existing rollback plan > provided by the caller for a previous rollback or attempt, or scheduling a > new rollback instant if none is provided. Currently it is not safe for two > concurrent jobs to call this API (when skipLocking=False and the callers > aren't already holding a lock), as this can lead to an issue where multiple > rollback requested plans are created or two jobs are executing the same > rollback instant at the same time. > h1. Proposed change > One way to resolve this issue is to refactor this rollback function such that > if skipLocking=false, the following steps are followed > # Acquire the table lock > # Reload the active timeline > # Look at the active timeline to see if there is a inflight rollback instant > from a previous rollback attempt, if it exists then assign this is as the > rollback plan to execute. Also, check if a pending rollback plan was passed > in by caller. Then it executes the following steps depending on whether the > caller passed a pending rollback instant plan. > ## [a] If a pending inflight rollback plan was passed in by caller, then > check that there is a previous attempted rollback instant on timeline (and > that the instant times match) and continue to use this rollback plan. If that > isn't the case, then raise a rollback exception since this means another job > has concurrently already executed this plan. Note that in a valid HUDI > dataset there can be at most one rollback instant for a corresponding commit > instant, which is why if we no longer see a pending rollback in timeline in > this phase we can safely assume that it had already been executed to > completion. > ## [b] If no pending inflight rollback plan was passed in by caller and no > pending rollback instant was found in timeline earlier, then schedule a new > rollback plan > # Now that a rollback plan and requested rollback instant time has been > assigned, check for an active heartbeat for the rollback instant time. If > there is one, then abort the rollback as that means there is a concurrent job > executing that rollback. If not, then start a heartbeat for that rollback > instant time. > # Release the table lock > # Execute the rollback plan and complete the rollback instant. Regardless of > whether this succeeds or fails with an exception, close the heartbeat. This > increases the chance that the next job that tries to call this rollback API > will follow through with the rollback and not abort due to an active previous > heartbeat > > * These steps will only be enforced for skipLocking=false, since if > skipLocking=true then that means the caller may already be explicitly holding > a table lock. In this case, acquiring the lock again in step (1) will fail. > * Acquiring a lock and reloading timeline for (1-3) will guard against data > race conditions where another job calls this rollback API at same time and > schedules its own rollback plan and instant. This is since if no rollback has > been attempted before for this instant, then before step (1), there is a > window of time where another concurrent rollback job could have scheduled a > rollback plan, failed execution, and cleaned up heart
[GitHub] [hudi] big-doudou commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.
big-doudou commented on issue #8892: URL: https://github.com/apache/hudi/issues/8892#issuecomment-1663264488 > https://github.com/apache/hudi/pull/9182 You can read danny0405's reply. He said that there will be another bootloader for the rollback. I haven't had time to test the details. I will check this issue in detail next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] voonhous commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.
voonhous commented on issue #8892: URL: https://github.com/apache/hudi/issues/8892#issuecomment-1663258600 Spent 2 more hours looking at this issue: What happened was that I was testing this on 0.12.1 without this PR: https://github.com/apache/hudi/pull/7208 To reproduce this error: Add the snippet into `org.apache.hudi.sink.StreamWriteFunction#flushRemaining`: ```java if (taskID == 0) { // trigger a failure throw new HoodieException("Intentional failure on taskID 0 thrown to invoke partial failover?"); } Prior to this enhancement, rollbacks will be created whenever a TM fails to remove all the partially written files. However, after this enhancement rollbacks will not be created unless a job is restarted or global failover happens. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6320] Fix partition parsing in Spark file index for custom keygen (#9273)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 2d779fb5aa1 [HUDI-6320] Fix partition parsing in Spark file index for custom keygen (#9273) 2d779fb5aa1 is described below commit 2d779fb5aa1ebfd33676ebf29217f25c60e17d12 Author: Sagar Sumit AuthorDate: Thu Aug 3 09:17:38 2023 +0530 [HUDI-6320] Fix partition parsing in Spark file index for custom keygen (#9273) --- .../scala/org/apache/hudi/HoodieFileIndex.scala| 14 - .../apache/hudi/SparkHoodieTableFileIndex.scala| 13 ++-- .../scala/org/apache/hudi/cdc/HoodieCDCRDD.scala | 2 +- .../org/apache/hudi/TestHoodieFileIndex.scala | 34 --- .../apache/hudi/functional/TestCOWDataSource.scala | 69 +- 5 files changed, 99 insertions(+), 33 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala index 3767b65a8ce..a7e90b2fe50 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala @@ -79,7 +79,7 @@ case class HoodieFileIndex(spark: SparkSession, spark = spark, metaClient = metaClient, schemaSpec = schemaSpec, -configProperties = getConfigProperties(spark, options), +configProperties = getConfigProperties(spark, options, metaClient), queryPaths = HoodieFileIndex.getQueryPaths(options), specifiedQueryInstant = options.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key).map(HoodieSqlCommonUtils.formatQueryInstant), fileStatusCache = fileStatusCache @@ -324,7 +324,7 @@ object HoodieFileIndex extends Logging { schema.fieldNames.filter { colName => refs.exists(r => resolver.apply(colName, r.name)) } } - def getConfigProperties(spark: SparkSession, options: Map[String, String]) = { + def getConfigProperties(spark: SparkSession, options: Map[String, String], metaClient: HoodieTableMetaClient) = { val sqlConf: SQLConf = spark.sessionState.conf val properties = TypedProperties.fromMap(options.filter(p => p._2 != null).asJava) @@ -342,6 +342,16 @@ object HoodieFileIndex extends Logging { if (listingModeOverride != null) { properties.setProperty(DataSourceReadOptions.FILE_INDEX_LISTING_MODE_OVERRIDE.key, listingModeOverride) } +val partitionColumns = metaClient.getTableConfig.getPartitionFields +if (partitionColumns.isPresent) { + // NOTE: Multiple partition fields could have non-encoded slashes in the partition value. + // We might not be able to properly parse partition-values from the listed partition-paths. + // Fallback to eager listing in this case. + if (partitionColumns.get().length > 1 +&& (listingModeOverride == null || DataSourceReadOptions.FILE_INDEX_LISTING_MODE_LAZY.equals(listingModeOverride))) { + properties.setProperty(DataSourceReadOptions.FILE_INDEX_LISTING_MODE_OVERRIDE.key, DataSourceReadOptions.FILE_INDEX_LISTING_MODE_EAGER) + } +} properties } diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala index 35ef3e9f066..b3d9e5659e8 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala @@ -29,11 +29,9 @@ import org.apache.hudi.common.model.{FileSlice, HoodieTableQueryType} import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} import org.apache.hudi.common.util.ValidationUtils.checkState import org.apache.hudi.config.HoodieBootstrapConfig.DATA_QUERIES_ONLY -import org.apache.hudi.hadoop.CachingPath -import org.apache.hudi.hadoop.CachingPath.createRelativePathUnsafe import org.apache.hudi.internal.schema.Types.RecordType import org.apache.hudi.internal.schema.utils.Conversions -import org.apache.hudi.keygen.{StringPartitionPathFormatter, TimestampBasedAvroKeyGenerator, TimestampBasedKeyGenerator} +import org.apache.hudi.keygen.{CustomAvroKeyGenerator, CustomKeyGenerator, StringPartitionPathFormatter, TimestampBasedAvroKeyGenerator, TimestampBasedKeyGenerator} import org.apache.hudi.util.JFunction import org.apache.spark.api.java.JavaSparkContext import org.apache.spark.internal.Logging @@ -44,7 +42,6 @@ import org.apache.spark.sql.catalyst.{InternalRow, expressions} import org.apache.spark.sql.execution.datasources.{FileStatusCache, NoopCache} import org.ap
[GitHub] [hudi] codope merged pull request #9273: [HUDI-6320] Fix partition parsing in Spark file index for custom keygen
codope merged PR #9273: URL: https://github.com/apache/hudi/pull/9273 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9327: [HUDI-6617] make HoodieRecordDelegate implement KryoSerializable
hudi-bot commented on PR #9327: URL: https://github.com/apache/hudi/pull/9327#issuecomment-1663249477 ## CI report: * d875b12ed9e6742f2ad1a2dcd8405d7ab74295a2 UNKNOWN * 06b31f2908be2285ad9e270195684f488cfff2bc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19003) * 9f6586fa89ccbb464f282c46df781f0280a14762 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19024) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9327: [HUDI-6617] make HoodieRecordDelegate implement KryoSerializable
hudi-bot commented on PR #9327: URL: https://github.com/apache/hudi/pull/9327#issuecomment-1663245042 ## CI report: * d875b12ed9e6742f2ad1a2dcd8405d7ab74295a2 UNKNOWN * 06b31f2908be2285ad9e270195684f488cfff2bc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19003) * 9f6586fa89ccbb464f282c46df781f0280a14762 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9273: [HUDI-6320] Fix partition parsing in Spark file index for custom keygen
hudi-bot commented on PR #9273: URL: https://github.com/apache/hudi/pull/9273#issuecomment-1663244910 ## CI report: * 3b54d26d8787cdb0cc1bccd86bcaa2e40b3d94a7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18981) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9350: [HUDI-2141] Support flink read metrics
hudi-bot commented on PR #9350: URL: https://github.com/apache/hudi/pull/9350#issuecomment-1663240644 ## CI report: * f36281ccc97ad7a566fd73ddc40543e573ce68b0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19022) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9273: [HUDI-6320] Fix partition parsing in Spark file index for custom keygen
hudi-bot commented on PR #9273: URL: https://github.com/apache/hudi/pull/9273#issuecomment-1663240436 ## CI report: * 3b54d26d8787cdb0cc1bccd86bcaa2e40b3d94a7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zbbkeepgoing opened a new issue, #9351: [SUPPORT] The point query performance after clustering is lags behind Delta Lake.
zbbkeepgoing opened a new issue, #9351: URL: https://github.com/apache/hudi/issues/9351 **Describe the problem you faced** - Our scenario We have 700 million records in our original offline table, distributed across 10 partitions. Each partition has a different data size, ranging from 10GB to 200GB. We plan to ingest this data into a data lake and test the point query performance after applying Clustering. - Point query scenario The original table has a column called "vin," which will be used as a filter along with the time partition column for point queries. - Hudi configuration ``` hoodie.clustering.plan.strategy.target.file.max.bytes is set to 1GB, consistent with Delta Lake's default value. hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy hoodie.clustering.plan.strategy.sort.columns=vin hoodie.clustering.rollback.pending.replacecommit.on.conflict=true hoodie.clustering.plan.strategy.daybased.lookback.partitions=10 hoodie.clustering.plan.partition.filter.mode=SELECTED_PARTITIONS hoodie.clustering.plan.strategy.cluster.begin.partition=part_dt=20230614 hoodie.clustering.plan.strategy.cluster.end.partition=part_dt=20230623 hoodie.clustering.plan.strategy.max.bytes.per.group=17179869184 hoodie.clustering.plan.strategy.max.num.groups=128 hoodie.layout.optimize.enable=true hoodie.layout.optimize.strategy=z-order ``` - Phenomena we observed 1. After Clustering, both Hudi and Delta Lake produce Parquet files of approximately 1GB, with an error margin of around 200MB. 2. With Clustering applied, when performing point queries, Hudi scans around 10 files in partitions with larger data, while Delta Lake typically scans only 1-2 files regardless of the partition. 3. We conducted performance tests with 10 concurrent and 1 concurrent queries. We ran hundreds of rounds of tests on both Hudi and Delta Lake, with different combinations of "vin" and time partition columns. The final conclusion was that Delta Lake performs three times better than Hudi. After examining Hudi's List file code, we found that Hudi primarily uses column statistics (min and max values) to retrieve candidate files. Therefore, we believe that the List file logic itself is unlikely to be the cause of the performance lag. It is highly likely that the issue lies in the Clustering algorithm itself. Can you please analyze from a professional perspective what is the reason behind this? Because it determines which data lake technology we ultimately choose. **Expected behavior** The point query performance after clustering is comparable to Delta Lake. **Environment Description** * Hudi version : 0.13.1 * Spark version : 3.3 * Hive version : 2.3.9 * Hadoop version : 2.x * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] eric9204 commented on a diff in pull request #9327: [HUDI-6617] make HoodieRecordDelegate implement KryoSerializable
eric9204 commented on code in PR #9327: URL: https://github.com/apache/hudi/pull/9327#discussion_r1282589478 ## hudi-common/src/test/java/org/apache/hudi/common/model/TestHoodieRecordDelegate.java: ## @@ -70,4 +78,24 @@ public void testKryoSerializeDeserialize() { assertEquals(new HoodieRecordLocation("001", "file01"), hoodieRecordDelegate.getCurrentLocation().get()); assertEquals(new HoodieRecordLocation("001", "file-01"), hoodieRecordDelegate.getNewLocation().get()); } + + public Kryo getKryoInstance() { +final Kryo kryo = new Kryo(); +// This instance of Kryo should not require prior registration of classes +kryo.setRegistrationRequired(false); +kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new StdInstantiatorStrategy())); +// Handle cases where we may have an odd classloader setup like with libjars +// for hadoop +kryo.setClassLoader(Thread.currentThread().getContextClassLoader()); + +// Register Hudi's classes +new HoodieCommonKryoRegistrar().registerClasses(kryo); + +// Register serializers +kryo.register(Utf8.class, new SerializationUtils.AvroUtf8Serializer()); +kryo.register(GenericData.Fixed.class, new GenericAvroSerializer<>()); Review Comment: No, the member variable types of `HoodieRecordDelegate` don't contain avro.that should't be registered. Has been updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9350: [HUDI-2141] Support flink read metrics
danny0405 commented on code in PR #9350: URL: https://github.com/apache/hudi/pull/9350#discussion_r1282587487 ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadMonitoringFunction.java: ## @@ -262,9 +268,16 @@ public void snapshotState(FunctionSnapshotContext context) throws Exception { this.instantState.clear(); if (this.issuedInstant != null) { this.instantState.add(this.issuedInstant); + this.readMetrics.setIssuedInstant(this.issuedInstant); } if (this.issuedOffset != null) { Review Comment: Does the metrics got updated for each read? ## hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/StreamReadOperator.java: ## @@ -168,6 +174,8 @@ private void processSplits() throws IOException { currentSplitState = SplitState.IDLE; } +readMetrics.setSplitLatestCommit(split.getLatestCommit()); + Review Comment: Does the metrics got updated for each read? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] Pass prepped boolean correctly in sql writer (#9320)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 62a9279d666 [MINOR] Pass prepped boolean correctly in sql writer (#9320) 62a9279d666 is described below commit 62a9279d46fd7abe1872857ea2f94fdedd46 Author: Sagar Sumit AuthorDate: Thu Aug 3 08:22:59 2023 +0530 [MINOR] Pass prepped boolean correctly in sql writer (#9320) --- .../scala/org/apache/hudi/HoodieSparkSqlWriter.scala | 3 +-- .../sql/hudi/command/MergeIntoHoodieTableCommand.scala | 16 .../hudi/TestMergeIntoTableWithNonRecordKeyField.scala | 3 --- 3 files changed, 9 insertions(+), 13 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala index fcee3fdab49..07b16e1e47d 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala @@ -404,8 +404,7 @@ object HoodieSparkSqlWriter { hoodieRecords } client.startCommitWithTime(instantTime, commitActionType) -val writeResult = DataSourceUtils.doWriteOperation(client, dedupedHoodieRecords, instantTime, operation, - isPrepped) +val writeResult = DataSourceUtils.doWriteOperation(client, dedupedHoodieRecords, instantTime, operation, isPrepped) (writeResult, client) } diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala index eba75c95452..f830c552bc8 100644 --- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala +++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala @@ -24,8 +24,8 @@ import org.apache.hudi.HoodieSparkSqlWriter.CANONICALIZE_NULLABLE import org.apache.hudi.avro.HoodieAvroUtils import org.apache.hudi.common.model.HoodieAvroRecordMerger import org.apache.hudi.common.util.StringUtils -import org.apache.hudi.config.HoodieWriteConfig.{AVRO_SCHEMA_VALIDATE_ENABLE, SCHEMA_ALLOW_AUTO_EVOLUTION_COLUMN_DROP, TBL_NAME} import org.apache.hudi.config.HoodieWriteConfig +import org.apache.hudi.config.HoodieWriteConfig.{AVRO_SCHEMA_VALIDATE_ENABLE, SCHEMA_ALLOW_AUTO_EVOLUTION_COLUMN_DROP, TBL_NAME} import org.apache.hudi.exception.HoodieException import org.apache.hudi.hive.HiveSyncConfigHolder import org.apache.hudi.sync.common.HoodieSyncConfig @@ -342,7 +342,9 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie val tableMetaCols = mergeInto.targetTable.output.filter(a => isMetaField(a.name)) val joinData = sparkAdapter.getCatalystPlanUtils.createMITJoin(mergeInto.sourceTable, mergeInto.targetTable, LeftOuter, Some(mergeInto.mergeCondition), "NONE") val incomingDataCols = joinData.output.filterNot(mergeInto.targetTable.outputSet.contains) -val projectedJoinPlan = if (sparkSession.sqlContext.conf.getConfString(SPARK_SQL_OPTIMIZED_WRITES.key(), SPARK_SQL_OPTIMIZED_WRITES.defaultValue()) == "true") { +// for pkless table, we need to project the meta columns +val hasPrimaryKey = hoodieCatalogTable.tableConfig.getRecordKeyFields.isPresent +val projectedJoinPlan = if (!hasPrimaryKey || sparkSession.sqlContext.conf.getConfString(SPARK_SQL_OPTIMIZED_WRITES.key(), "false") == "true") { Project(tableMetaCols ++ incomingDataCols, joinData) } else { Project(incomingDataCols, joinData) @@ -619,12 +621,10 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie // default value ("ts") // TODO(HUDI-3456) clean up val preCombineField = hoodieCatalogTable.preCombineKey.getOrElse("") - val hiveSyncConfig = buildHiveSyncConfig(sparkSession, hoodieCatalogTable, tableConfig) - -val enableOptimizedMerge = sparkSession.sqlContext.conf.getConfString(SPARK_SQL_OPTIMIZED_WRITES.key(), - SPARK_SQL_OPTIMIZED_WRITES.defaultValue()) - +// for pkless tables, we need to enable optimized merge +val hasPrimaryKey = tableConfig.getRecordKeyFields.isPresent +val enableOptimizedMerge = if (!hasPrimaryKey) "true" else sparkSession.sqlContext.conf.getConfString(SPARK_SQL_OPTIMIZED_WRITES.key(), "false") val keyGeneratorClassName = if (enableOptimizedMerge == "true") { classOf[MergeIntoKeyGenerator].getCanonicalName } else { @@ -653,7 +653,7 @@ case class MergeIntoHoodieTab
[GitHub] [hudi] codope merged pull request #9320: [MINOR] Infer prepped boolean correctly and disable prepped write for MergeInto
codope merged PR #9320: URL: https://github.com/apache/hudi/pull/9320 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-6569] Fix write failure for Avro Enum type (#9237)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 4017b96fb1b [HUDI-6569] Fix write failure for Avro Enum type (#9237) 4017b96fb1b is described below commit 4017b96fb1bf47283d0d16deea28fb5dc806d8eb Author: Y Ethan Guo AuthorDate: Wed Aug 2 19:52:02 2023 -0700 [HUDI-6569] Fix write failure for Avro Enum type (#9237) - Fix a regression for Avro ENUM type. - Adds logic to handle ENUM type in `HoodieAvroUtils.rewriteRecordWithNewSchemaInternal` and `AvroDeserializer`. --- .../commit/TestJavaCopyOnWriteActionExecutor.java | 3 +- .../src/test/resources/testDataGeneratorSchema.txt | 132 .../commit/TestCopyOnWriteActionExecutor.java | 3 +- .../GenericRecordValidationTestUtils.java | 5 + .../src/test/resources/testDataGeneratorSchema.txt | 132 .../java/org/apache/hudi/avro/HoodieAvroUtils.java | 18 +- .../common/testutils/HoodieTestDataGenerator.java | 9 +- .../apache/hudi/common/util/TestAvroOrcUtils.java | 5 +- .../apache/spark/sql/avro/AvroDeserializer.scala | 1 + .../apache/spark/sql/avro/AvroDeserializer.scala | 1 + .../apache/spark/sql/avro/AvroDeserializer.scala | 1 + .../apache/spark/sql/avro/AvroDeserializer.scala | 15 +- .../apache/spark/sql/avro/AvroDeserializer.scala | 13 +- .../apache/spark/sql/avro/AvroDeserializer.scala | 1 + .../streamer-config/source-flattened.avsc | 101 + .../src/test/resources/streamer-config/source.avsc | 228 +++- .../resources/streamer-config/source_evolved.avsc | 4 + .../source_evolved_post_processed.avsc | 4 + .../streamer-config/sql-transformer.properties | 2 +- .../streamer-config/target-flattened.avsc | 108 ++ .../src/test/resources/streamer-config/target.avsc | 235 - 21 files changed, 451 insertions(+), 570 deletions(-) diff --git a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestJavaCopyOnWriteActionExecutor.java b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestJavaCopyOnWriteActionExecutor.java index a272585b360..f57b21d89be 100644 --- a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestJavaCopyOnWriteActionExecutor.java +++ b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestJavaCopyOnWriteActionExecutor.java @@ -408,11 +408,10 @@ public class TestJavaCopyOnWriteActionExecutor extends HoodieJavaClientTestHarne @Test public void testInsertUpsertWithHoodieAvroPayload() throws Exception { -Schema schema = getSchemaFromResource(TestJavaCopyOnWriteActionExecutor.class, "/testDataGeneratorSchema.txt"); HoodieWriteConfig config = HoodieWriteConfig.newBuilder() .withEngineType(EngineType.JAVA) .withPath(basePath) -.withSchema(schema.toString()) +.withSchema(TRIP_EXAMPLE_SCHEMA) .withStorageConfig(HoodieStorageConfig.newBuilder() .parquetMaxFileSize(1000 * 1024).hfileMaxFileSize(1000 * 1024).build()) .build(); diff --git a/hudi-client/hudi-java-client/src/test/resources/testDataGeneratorSchema.txt b/hudi-client/hudi-java-client/src/test/resources/testDataGeneratorSchema.txt deleted file mode 100644 index c80365b76ea..000 --- a/hudi-client/hudi-java-client/src/test/resources/testDataGeneratorSchema.txt +++ /dev/null @@ -1,132 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -{ - "type" : "record", - "name" : "triprec", - "fields" : [ - { -"name" : "timestamp", -"type" : "long" - }, { -"name" : "_row_key", -"type" : "string" - }, { - "name" : "partition_path", - "type" : ["null", "string"], - "default": null - }, { -"name" : "rider", -"type" : "string" - }, { -"name" : "driver", -"type" : "string" - }, { -"name" : "begin_lat", -"type" : "double" - }, { -"name" : "begin_lon", -"type" : "double" - }, { -
[GitHub] [hudi] codope merged pull request #9237: [HUDI-6569] Fix write failure for Avro Enum type
codope merged PR #9237: URL: https://github.com/apache/hudi/pull/9237 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9350: [HUDI-2141] Support flink read metrics
hudi-bot commented on PR #9350: URL: https://github.com/apache/hudi/pull/9350#issuecomment-1663214407 ## CI report: * f36281ccc97ad7a566fd73ddc40543e573ce68b0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xuzifu666 closed pull request #9349: [MINOR] JSR dependency not used in spark3.3 version
xuzifu666 closed pull request #9349: [MINOR] JSR dependency not used in spark3.3 version URL: https://github.com/apache/hudi/pull/9349 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9349: [MINOR] JSR dependency not used in spark3.3 version
hudi-bot commented on PR #9349: URL: https://github.com/apache/hudi/pull/9349#issuecomment-1663209721 ## CI report: * 7c3142bdb0e1b1c677e61495e42c81e44916e1a0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19021) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9336: [HUDI-6629] - Changes for s3/gcs IncrSource job to taken into sourceLimit during ingestion
hudi-bot commented on PR #9336: URL: https://github.com/apache/hudi/pull/9336#issuecomment-1663209678 ## CI report: * 77d7b455ee5cd668a005f6f7e6f04135608f2b7a UNKNOWN * 1af3c1cd31e9ec695e98e8c2f58cb6ed03ce6dc4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19009) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 commented on pull request #9350: Support flink read metrics
stream2000 commented on PR #9350: URL: https://github.com/apache/hudi/pull/9350#issuecomment-1663209301 @danny0405 Could you help review this pr? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] stream2000 opened a new pull request, #9350: Support flink read metrics
stream2000 opened a new pull request, #9350: URL: https://github.com/apache/hudi/pull/9350 ### Change Logs Subtask for HUDI-2141, support flink read metrics stream write metrics and compaction metrics see #9118 ### Impact add some metrics ### Risk level (write none, low medium or high below) none ### Documentation Update Will update document after merge ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on issue #8848: [SUPPORT] Hive Sync tool fails to sync Hoodi table written using Flink 1.16 to HMS
danny0405 commented on issue #8848: URL: https://github.com/apache/hudi/issues/8848#issuecomment-1663205818 Yeah, maybe it's my fault, we do not exclude calcite when packaging the bundle with hive-exec, maybe for some Hive version since 3.x, the calcite related classes are required, but the hive-exec itself does not include the calcite, do you package by using the same verison hive-exec as your hive server? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9349: [MINOR] JSR dependency not used in spark3.3 version
hudi-bot commented on PR #9349: URL: https://github.com/apache/hudi/pull/9349#issuecomment-1663204878 ## CI report: * 7c3142bdb0e1b1c677e61495e42c81e44916e1a0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9337: [HUDI-6628] Rely on methods in HoodieBaseFile and HoodieLogFile instead of FSUtils when possible
hudi-bot commented on PR #9337: URL: https://github.com/apache/hudi/pull/9337#issuecomment-1663204832 ## CI report: * 9cbb48c5cad3d7b467a05eee5a692900539ed863 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19010) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9237: [HUDI-6569] Fix write failure for Avro Enum type
hudi-bot commented on PR #9237: URL: https://github.com/apache/hudi/pull/9237#issuecomment-1663204597 ## CI report: * a30830bcec5f907c190d3349be68297f72a158c1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18988) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xuzifu666 commented on pull request #9001: [HUDI-6402] Hudi Spark3.3 and upper version need close JavaTimeModule For JsonUtils
xuzifu666 commented on PR #9001: URL: https://github.com/apache/hudi/pull/9001#issuecomment-1663203677 > > what is your hudi version? > > @xuzifu666, I am building hudi from the master branch. > > > and from the stack, and jar submit,it maybe your user jar contains jsr depency and version is too low > > I don't have a user jar. Everything here is hudi codebase. I am just trying to run the integration tests from command line. The only dependency I see is on jackson 2.10 > > `mvn clean dependency:tree -Dincludes=com.fasterxml.jackson.datatype -Pintegration-tests` > > This has to do something with the runtime setup. Note the package name in `NoClassDefFoundError` message. It is looking for `JavaTimeModule` in the wrong package somehow: > > `java.lang.NoClassDefFoundError: org/apache/hudi/com/fasterxml/jackson/datatype/jsr310/JavaTimeModule` https://github.com/apache/hudi/pull/9349 can this pr resolve your problem? @amrishlal -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9199: [HUDI-6534]Support consistent hashing row writer
danny0405 commented on PR #9199: URL: https://github.com/apache/hudi/pull/9199#issuecomment-1663203592 @leesf , is it good to land now ? We still have 2 days for the 0.14.0 code freeze. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on pull request #9287: [HUDI-6592] Flink insert overwrite should support dynamic partition and whole table
SteNicholas commented on PR #9287: URL: https://github.com/apache/hudi/pull/9287#issuecomment-1663192615 @danny0405, the current behavior and config is consistent with Spark insert overwrite. PTAL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xuzifu666 commented on pull request #9349: [MINOR] JSR dependency not used in spark3.3 version
xuzifu666 commented on PR #9349: URL: https://github.com/apache/hudi/pull/9349#issuecomment-1663176001 cc @xushiyan have a review please -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xuzifu666 opened a new pull request, #9349: [MINOR] JSR dependency not used in spark3.3 version
xuzifu666 opened a new pull request, #9349: URL: https://github.com/apache/hudi/pull/9349 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ JSR dependency not used in spark3.3 version ### Impact _Describe any public API or user-facing feature change or any performance impact._ none ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #7580: [HUDI-5434] Fix archival in metadata table to not rely on completed rollback or clean in data table
danny0405 commented on PR #7580: URL: https://github.com/apache/hudi/pull/7580#issuecomment-1663174983 > have you done some work on this No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on pull request #7580: [HUDI-5434] Fix archival in metadata table to not rely on completed rollback or clean in data table
Zouxxyy commented on PR #7580: URL: https://github.com/apache/hudi/pull/7580#issuecomment-1663173182 @danny0405 > That's true, we should optimize the archiving of cleaning and rollback. I see you are working on LSM tree based archive timeline, have you done some work on this? If not, I'd like to work for it, the current process of `getInstantsToArchive` is a bit complicated, I will sort it out as a whole. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6568] Hudi Spark Integration Redesign
hudi-bot commented on PR #9276: URL: https://github.com/apache/hudi/pull/9276#issuecomment-1663165947 ## CI report: * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN * f179c083ce951ed076bc382ee252c89d8e07d49d Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19013) * 293ae466c121508e2e1d0b32c384c99ea1eea707 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19018) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6627) Spark write client fails when write schema is null
[ https://issues.apache.org/jira/browse/HUDI-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6627: - Fix Version/s: 0.14.0 > Spark write client fails when write schema is null > -- > > Key: HUDI-6627 > URL: https://issues.apache.org/jira/browse/HUDI-6627 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinish Reddy >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.0 > > > When source returns an empty option in deltastreamer, the writer schema is > null. This causes an NPE with the table schema validation in spark write > client causing the below exception. We should skip this validation when > writer schema is null. > {code:java} > org.apache.hudi.exception.HoodieInsertException: Failed insert schema > compability check. > at > org.apache.hudi.table.HoodieTable.validateInsertSchema(HoodieTable.java:851) > at > org.apache.hudi.client.SparkRDDWriteClient.insert(SparkRDDWriteClient.java:185) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:690) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:396) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.ingestOnce(HoodieDeltaStreamer.java:876) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) > at > com.onehouse.hudi.OnehouseDeltaStreamer$MultiTableSyncService.lambda$null$1(OnehouseDeltaStreamer.java:319) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > Caused by: org.apache.hudi.exception.HoodieException: Failed to read > schema/check compatibility for base path > s3a://onehouse-customer-bucket-2451e78f/data-lake/chandra_data_lake_default/xml_flatten_struct_test > at > org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:830) > at > org.apache.hudi.table.HoodieTable.validateInsertSchema(HoodieTable.java:849) > ... 10 more > Caused by: java.lang.NullPointerException > at > com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:1158) > at org.apache.avro.Schema$Parser.parse(Schema.java:1418) > at > org.apache.hudi.avro.HoodieAvroUtils.createHoodieWriteSchema(HoodieAvroUtils.java:302) > at > org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:826) > ... 11 more > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6627) Spark write client fails when write schema is null
[ https://issues.apache.org/jira/browse/HUDI-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6627. Resolution: Fixed Fixed via master branch: 95d0fb5d3276936a3638baed31edc4d9fe0d1f34 > Spark write client fails when write schema is null > -- > > Key: HUDI-6627 > URL: https://issues.apache.org/jira/browse/HUDI-6627 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinish Reddy >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.0 > > > When source returns an empty option in deltastreamer, the writer schema is > null. This causes an NPE with the table schema validation in spark write > client causing the below exception. We should skip this validation when > writer schema is null. > {code:java} > org.apache.hudi.exception.HoodieInsertException: Failed insert schema > compability check. > at > org.apache.hudi.table.HoodieTable.validateInsertSchema(HoodieTable.java:851) > at > org.apache.hudi.client.SparkRDDWriteClient.insert(SparkRDDWriteClient.java:185) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:690) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:396) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.ingestOnce(HoodieDeltaStreamer.java:876) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) > at > com.onehouse.hudi.OnehouseDeltaStreamer$MultiTableSyncService.lambda$null$1(OnehouseDeltaStreamer.java:319) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > Caused by: org.apache.hudi.exception.HoodieException: Failed to read > schema/check compatibility for base path > s3a://onehouse-customer-bucket-2451e78f/data-lake/chandra_data_lake_default/xml_flatten_struct_test > at > org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:830) > at > org.apache.hudi.table.HoodieTable.validateInsertSchema(HoodieTable.java:849) > ... 10 more > Caused by: java.lang.NullPointerException > at > com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:1158) > at org.apache.avro.Schema$Parser.parse(Schema.java:1418) > at > org.apache.hudi.avro.HoodieAvroUtils.createHoodieWriteSchema(HoodieAvroUtils.java:302) > at > org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:826) > ... 11 more > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6627] Fix NPE when spark client writer schema is null (#9335)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 95d0fb5d327 [HUDI-6627] Fix NPE when spark client writer schema is null (#9335) 95d0fb5d327 is described below commit 95d0fb5d3276936a3638baed31edc4d9fe0d1f34 Author: Vinish Reddy AuthorDate: Thu Aug 3 06:39:13 2023 +0530 [HUDI-6627] Fix NPE when spark client writer schema is null (#9335) --- .../java/org/apache/hudi/table/HoodieTable.java| 5 +- .../hudi/testutils/HoodieClientTestBase.java | 6 +- .../apache/hudi/functional/TestWriteClient.java| 87 ++ 3 files changed, 96 insertions(+), 2 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java index 71295098f03..12584be55a4 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java @@ -62,6 +62,7 @@ import org.apache.hudi.common.table.view.TableFileSystemView.SliceView; import org.apache.hudi.common.util.ClusteringUtils; import org.apache.hudi.common.util.Functions; import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.StringUtils; import org.apache.hudi.common.util.ValidationUtils; import org.apache.hudi.common.util.collection.Pair; import org.apache.hudi.config.HoodieWriteConfig; @@ -825,7 +826,9 @@ public abstract class HoodieTable implements Serializable { boolean shouldValidate = config.shouldValidateAvroSchema(); boolean allowProjection = config.shouldAllowAutoEvolutionColumnDrop(); if ((!shouldValidate && allowProjection) -|| getActiveTimeline().getCommitsTimeline().filterCompletedInstants().empty()) { +|| getActiveTimeline().getCommitsTimeline().filterCompletedInstants().empty() +|| StringUtils.isNullOrEmpty(config.getSchema()) +) { // Check not required return; } diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestBase.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestBase.java index 454236b4278..569e8d36d89 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestBase.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestBase.java @@ -158,7 +158,7 @@ public class HoodieClientTestBase extends HoodieClientTestHarness { */ public HoodieWriteConfig.Builder getConfigBuilder(String schemaStr, IndexType indexType, HoodieFailedWritesCleaningPolicy cleaningPolicy) { -return HoodieWriteConfig.newBuilder().withPath(basePath).withSchema(schemaStr) +HoodieWriteConfig.Builder builder = HoodieWriteConfig.newBuilder().withPath(basePath) .withParallelism(2, 2).withBulkInsertParallelism(2).withFinalizeWriteParallelism(2).withDeleteParallelism(2) .withTimelineLayoutVersion(TimelineLayoutVersion.CURR_VERSION) .withWriteStatusClass(MetadataMergeWriteStatus.class) @@ -172,6 +172,10 @@ public class HoodieClientTestBase extends HoodieClientTestHarness { .withEnableBackupForRemoteFileSystemView(false) // Fail test if problem connecting to timeline-server .withRemoteServerPort(timelineServicePort) .withStorageType(FileSystemViewStorageType.EMBEDDED_KV_STORE).build()); +if (StringUtils.nonEmpty(schemaStr)) { + builder.withSchema(schemaStr); +} +return builder; } public HoodieSparkTable getHoodieTable(HoodieTableMetaClient metaClient, HoodieWriteConfig config) { diff --git a/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestWriteClient.java b/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestWriteClient.java new file mode 100644 index 000..7acf6b2b6b0 --- /dev/null +++ b/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestWriteClient.java @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * W
[GitHub] [hudi] danny0405 merged pull request #9335: [HUDI-6627] Fix NPE when spark client writer schema is null
danny0405 merged PR #9335: URL: https://github.com/apache/hudi/pull/9335 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9327: [HUDI-6617] make HoodieRecordDelegate implement KryoSerializable
danny0405 commented on code in PR #9327: URL: https://github.com/apache/hudi/pull/9327#discussion_r1282541082 ## hudi-common/src/test/java/org/apache/hudi/common/model/TestHoodieRecordDelegate.java: ## @@ -70,4 +78,24 @@ public void testKryoSerializeDeserialize() { assertEquals(new HoodieRecordLocation("001", "file01"), hoodieRecordDelegate.getCurrentLocation().get()); assertEquals(new HoodieRecordLocation("001", "file-01"), hoodieRecordDelegate.getNewLocation().get()); } + + public Kryo getKryoInstance() { +final Kryo kryo = new Kryo(); +// This instance of Kryo should not require prior registration of classes +kryo.setRegistrationRequired(false); +kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new StdInstantiatorStrategy())); +// Handle cases where we may have an odd classloader setup like with libjars +// for hadoop +kryo.setClassLoader(Thread.currentThread().getContextClassLoader()); + +// Register Hudi's classes +new HoodieCommonKryoRegistrar().registerClasses(kryo); + +// Register serializers +kryo.register(Utf8.class, new SerializationUtils.AvroUtf8Serializer()); +kryo.register(GenericData.Fixed.class, new GenericAvroSerializer<>()); Review Comment: Do we need a registration for avro classes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.
danny0405 commented on code in PR #9324: URL: https://github.com/apache/hudi/pull/9324#discussion_r1282536263 ## pom.xml: ## @@ -98,8 +98,6 @@ ${fasterxml.spark3.version} ${fasterxml.spark3.version} ${fasterxml.spark3.version} - - Review Comment: @amrishlal You are right, if Hudi bundle jar shade the class anyway, we should always include the jar in the bundle, or any reference to the JSR class could encounter class not found exception. Another choice is we do not shade the JSR clazz, do we need a shade here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6615) Fix append mode and BulkInsertWriterHelper in flink
[ https://issues.apache.org/jira/browse/HUDI-6615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6615. Resolution: Fixed Fixed via master branch: 9f2087a89443e93079d061fd81bf2f768f9c6953 > Fix append mode and BulkInsertWriterHelper in flink > > > Key: HUDI-6615 > URL: https://issues.apache.org/jira/browse/HUDI-6615 > Project: Apache Hudi > Issue Type: Bug >Reporter: zouxxyy >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6615) Fix append mode and BulkInsertWriterHelper in flink
[ https://issues.apache.org/jira/browse/HUDI-6615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6615: - Fix Version/s: 0.14.0 > Fix append mode and BulkInsertWriterHelper in flink > > > Key: HUDI-6615 > URL: https://issues.apache.org/jira/browse/HUDI-6615 > Project: Apache Hudi > Issue Type: Bug >Reporter: zouxxyy >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated: [HUDI-6615] Fix the condition of isInputSorted in BulkInsertWriterHelper (#9314)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 9f2087a8944 [HUDI-6615] Fix the condition of isInputSorted in BulkInsertWriterHelper (#9314) 9f2087a8944 is described below commit 9f2087a89443e93079d061fd81bf2f768f9c6953 Author: Zouxxyy AuthorDate: Thu Aug 3 08:50:31 2023 +0800 [HUDI-6615] Fix the condition of isInputSorted in BulkInsertWriterHelper (#9314) --- .../apache/hudi/configuration/OptionsResolver.java | 8 .../hudi/sink/bulk/BulkInsertWriterHelper.java | 3 ++- .../java/org/apache/hudi/sink/utils/Pipelines.java | 11 ++- .../apache/hudi/streamer/HoodieFlinkStreamer.java | 2 +- .../org/apache/hudi/table/HoodieTableSink.java | 5 ++--- .../apache/hudi/sink/ITTestDataStreamWrite.java| 2 +- .../hudi/sink/bucket/ITTestBucketStreamWrite.java | 23 +- .../bucket/ITTestConsistentBucketStreamWrite.java | 5 ++--- 8 files changed, 19 insertions(+), 40 deletions(-) diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java index 8f4b013de04..944e795dc2f 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java @@ -76,6 +76,14 @@ public class OptionsResolver { return operationType == WriteOperationType.INSERT; } + /** + * Returns whether the table operation is 'bulk_insert'. + */ + public static boolean isBulkInsertOperation(Configuration conf) { +WriteOperationType operationType = WriteOperationType.fromValue(conf.getString(FlinkOptions.OPERATION)); +return operationType == WriteOperationType.BULK_INSERT; + } + /** * Returns whether it is a MERGE_ON_READ table. */ diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bulk/BulkInsertWriterHelper.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bulk/BulkInsertWriterHelper.java index 56f668e32f0..3c0d4fb7662 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bulk/BulkInsertWriterHelper.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bulk/BulkInsertWriterHelper.java @@ -22,6 +22,7 @@ import org.apache.hudi.client.WriteStatus; import org.apache.hudi.common.model.HoodieRecord; import org.apache.hudi.config.HoodieWriteConfig; import org.apache.hudi.configuration.FlinkOptions; +import org.apache.hudi.configuration.OptionsResolver; import org.apache.hudi.exception.HoodieException; import org.apache.hudi.io.storage.row.HoodieRowDataCreateHandle; import org.apache.hudi.table.HoodieTable; @@ -84,7 +85,7 @@ public class BulkInsertWriterHelper { this.taskEpochId = taskEpochId; this.rowType = preserveHoodieMetadata ? rowType : addMetadataFields(rowType, writeConfig.allowOperationMetadataField()); // patch up with metadata fields this.preserveHoodieMetadata = preserveHoodieMetadata; -this.isInputSorted = conf.getBoolean(FlinkOptions.WRITE_BULK_INSERT_SORT_INPUT); +this.isInputSorted = OptionsResolver.isBulkInsertOperation(conf) && conf.getBoolean(FlinkOptions.WRITE_BULK_INSERT_SORT_INPUT); this.fileIdPrefix = UUID.randomUUID().toString(); this.keyGen = preserveHoodieMetadata ? null : RowDataKeyGen.instance(conf, rowType); } diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java index 5d945d07aa1..fe51fe435e1 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java @@ -202,19 +202,12 @@ public class Pipelines { * @param conf The configuration * @param rowTypeThe input row type * @param dataStream The input data stream - * @param boundedWhether the input stream is bounded * @return the appending data stream sink */ public static DataStream append( Configuration conf, RowType rowType, - DataStream dataStream, - boolean bounded) { -if (!bounded) { - // In principle, the config should be immutable, but the boundedness - // is only visible when creating the sink pipeline. - conf.setBoolean(FlinkOptions.WRITE_BULK_INSERT_SORT_INPUT, false); -} + DataStream dataStream) { WriteOperatorFactory operatorFactory = AppendWriteOperator.getFactory(conf, rowType); return dataStream @@ -469,7 +462,7 @@ public class Pipelines { }
[GitHub] [hudi] danny0405 merged pull request #9314: [HUDI-6615] Fix the condition of isInputSorted in BulkInsertWriterHelper
danny0405 merged PR #9314: URL: https://github.com/apache/hudi/pull/9314 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xuzifu666 commented on pull request #9001: [HUDI-6402] Hudi Spark3.3 and upper version need close JavaTimeModule For JsonUtils
xuzifu666 commented on PR #9001: URL: https://github.com/apache/hudi/pull/9001#issuecomment-1663139322 > > need close JavaTimeModule For JsonUtils > > @xuzifu666 can you help me understand what the PR title means? when use spark version leq 3.2,would report class not found for jsr,in that time is a ToDo fix,so move the judge to sparkadater,spark can run rightly @xushiyan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9332: [HUDI-6625] Lazy create metadata and viewManager in HoodieTable
hudi-bot commented on PR #9332: URL: https://github.com/apache/hudi/pull/9332#issuecomment-1663137677 ## CI report: * daa28a4bd88b29bf80b19210dfb4a54667e07cae UNKNOWN * 67e656c397338a14432a09013f007b0840c89db9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19007) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6568] Hudi Spark Integration Redesign
hudi-bot commented on PR #9276: URL: https://github.com/apache/hudi/pull/9276#issuecomment-1663137528 ## CI report: * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN * 87e8f76e3d97d5b3b2fc10fe7704395575cc1b79 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19005) * f179c083ce951ed076bc382ee252c89d8e07d49d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19013) * 293ae466c121508e2e1d0b32c384c99ea1eea707 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #7580: [HUDI-5434] Fix archival in metadata table to not rely on completed rollback or clean in data table
danny0405 commented on PR #7580: URL: https://github.com/apache/hudi/pull/7580#issuecomment-1663133848 > then these rollback instants will stay in the active timeline forever. That's true, we should optimize the archiving of cleaning and rollback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9001: [HUDI-6402] Hudi Spark3.3 and upper version need close JavaTimeModule For JsonUtils
danny0405 commented on PR #9001: URL: https://github.com/apache/hudi/pull/9001#issuecomment-1663133061 > n the wrong package somehow Guess it is because some of the spark version has the dependency and in the hudi pom, we shaded the clazz, but in some spark version, we do not have this dependency. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6632) Revert FileSystemBackedTableMetadata#getAllPartitionPaths improvements due to HUDI-6476
[ https://issues.apache.org/jira/browse/HUDI-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6632: - Labels: pull-request-available (was: ) > Revert FileSystemBackedTableMetadata#getAllPartitionPaths improvements due to > HUDI-6476 > --- > > Key: HUDI-6632 > URL: https://issues.apache.org/jira/browse/HUDI-6632 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9343: [HUDI-6632] Revert "[HUDI-6476] Improve the performance of getAllPartitionPaths (#9121)"
hudi-bot commented on PR #9343: URL: https://github.com/apache/hudi/pull/9343#issuecomment-1663132727 ## CI report: * 9d8464b88ac7656685cdd06f74efb6600b7d2250 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19006) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9314: [HUDI-6615] Fix the condition of isInputSorted in BulkInsertWriterHelper
hudi-bot commented on PR #9314: URL: https://github.com/apache/hudi/pull/9314#issuecomment-1663132594 ## CI report: * 416c1dfc455a53bfe1d5367b7ab6d02aabd3a6dd UNKNOWN * cec56b320c0b83d49edbc453a9e50934c661d87d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19008) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-6466) Spark's capcity of insert overwrite partitioned table with dynamic partition lost
[ https://issues.apache.org/jira/browse/HUDI-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen closed HUDI-6466. Resolution: Fixed Fixed via master branch: d67455a4a713e295bba1d0a5d338fcfbe5af217e > Spark's capcity of insert overwrite partitioned table with dynamic partition > lost > - > > Key: HUDI-6466 > URL: https://issues.apache.org/jira/browse/HUDI-6466 > Project: Apache Hudi > Issue Type: Bug >Reporter: yonghua jian >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Mentioned as [#7365 > (comment)|https://github.com/apache/hudi/pull/7365#issuecomment-1338371540] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6466) Spark's capcity of insert overwrite partitioned table with dynamic partition lost
[ https://issues.apache.org/jira/browse/HUDI-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-6466: - Fix Version/s: 0.14.0 > Spark's capcity of insert overwrite partitioned table with dynamic partition > lost > - > > Key: HUDI-6466 > URL: https://issues.apache.org/jira/browse/HUDI-6466 > Project: Apache Hudi > Issue Type: Bug >Reporter: yonghua jian >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Mentioned as [#7365 > (comment)|https://github.com/apache/hudi/pull/7365#issuecomment-1338371540] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated (8da99f8a5c9 -> d67455a4a71)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 8da99f8a5c9 [HUDI-6540] Support failed writes clean policy for Flink (#9211) add d67455a4a71 [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition (#9113) No new revisions were added by this update. Summary of changes: .../scala/org/apache/hudi/DataSourceOptions.scala | 9 + .../spark/sql/hudi/ProvidesHoodieConfig.scala | 84 ++-- .../command/InsertIntoHoodieTableCommand.scala | 16 +- .../apache/spark/sql/hudi/TestInsertTable.scala| 228 + 4 files changed, 225 insertions(+), 112 deletions(-)
[GitHub] [hudi] danny0405 merged pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition
danny0405 merged PR #9113: URL: https://github.com/apache/hudi/pull/9113 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #9113: [HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition
danny0405 commented on PR #9113: URL: https://github.com/apache/hudi/pull/9113#issuecomment-1663129390 The failed test should be a flaky one: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=18976&view=logs&j=4c665d41-fe93-5d6b-3716-d7e63fa41849&t=f7ca1aa0-5550-5ab6-0ee3-d8a5a59e7ac4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] amrishlal commented on issue #9282: [ISSUE] Hudi 0.13.0. Spark 3.3.2 Deltastreamed table read failure
amrishlal commented on issue #9282: URL: https://github.com/apache/hudi/issues/9282#issuecomment-1663107557 @rmnlchh @ad1happy2go I am looking at the following part of the stack trace: ``` Cause: java.lang.IllegalArgumentException: For input string: "null" at scala.collection.immutable.StringLike.parseBoolean(StringLike.scala:330) at scala.collection.immutable.StringLike.toBoolean(StringLike.scala:289) at scala.collection.immutable.StringLike.toBoolean$(StringLike.scala:289) at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:33) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.(ParquetSchemaConverter.scala:70) at org.apache.spark.sql.execution.datasources.parquet.HoodieParquetFileFormatHelper$.buildImplicitSchemaChangeInfo(HoodieParquetFileFormatHelper.scala:30) ``` The stack trace seems to indicate that there was a problem while trying to convert a string value into boolean (see code line at [spark v3.3.2 ParquetSchemaConverter.scala:70](https://github.com/apache/spark/blob/5103e00c4ce5fcc4264ca9c4df12295d42557af6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L70) which I have pasted below): `conf.get(SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.key).toBoolean)` This line seems to indicate that you need to set `spark.sql.legacy.parquet.nanosAsLong` to enter 'true' or 'false' to avoid this exception from coming up (see definition of [LEGACY_PARQUET_NANOS_AS_LONG](https://github.com/apache/spark/blob/5103e00c4ce5fcc4264ca9c4df12295d42557af6/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3462C87-L3462C87) here). Please let me know if this doesn't fix the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9347: Upgrade aws java sdk to v2
hudi-bot commented on PR #9347: URL: https://github.com/apache/hudi/pull/9347#issuecomment-1663103718 ## CI report: * 4e17424eda9aa3bd50841ebc0f8846305b27f6d2 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19015) * d2360a5a7de655991202680013d20268ce325666 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19016) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ys8 opened a new issue, #9348: [SUPPORT] hide soft-deleted rows
ys8 opened a new issue, #9348: URL: https://github.com/apache/hudi/issues/9348 If my reading is correct, `SELECT COUNT(*) FROM hudi_table` still includes soft-deleted rows. If that's true, is there a way to completely hide soft-deleted rows from SELECT queries? [https://github.com/apache/hudi/blob/8da99f8a5c9ce3abd5a5a14baf3a8db81c3d39f0/hudi-[…]/hudi-examples-spark/src/test/python/HoodiePySparkQuickstart.py](https://github.com/apache/hudi/blob/8da99f8a5c9ce3abd5a5a14baf3a8db81c3d39f0/hudi-examples/hudi-examples-spark/src/test/python/HoodiePySparkQuickstart.py#L144-L185) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9347: Upgrade aws java sdk to v2
hudi-bot commented on PR #9347: URL: https://github.com/apache/hudi/pull/9347#issuecomment-1663099148 ## CI report: * 4e17424eda9aa3bd50841ebc0f8846305b27f6d2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19015) * d2360a5a7de655991202680013d20268ce325666 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9320: [MINOR] Infer prepped boolean correctly and disable prepped write for MergeInto
hudi-bot commented on PR #9320: URL: https://github.com/apache/hudi/pull/9320#issuecomment-1663094309 ## CI report: * 70e8bc9077123ca463bcc5912eb080ef37c36d3f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19004) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] amrishlal commented on a diff in pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.
amrishlal commented on code in PR #9324: URL: https://github.com/apache/hudi/pull/9324#discussion_r1282491522 ## packaging/hudi-integ-test-bundle/pom.xml: ## @@ -319,12 +319,19 @@ com.fasterxml.jackson.module jackson-module-scala_${scala.binary.version} + ${fasterxml.jackson.module.scala.version} com.fasterxml.jackson.dataformat jackson-dataformat-yaml - 2.7.4 + ${fasterxml.spark3.version} + + + + com.fasterxml.jackson.datatype + jackson-datatype-jsr310 + ${fasterxml.spark3.version} Review Comment: Also using `${fasterxml.jackson.module.scala.version}` and `${fasterxml.jackson.dataformat.yaml.version}` to pull in the appropriate version for `jackson-module-scala_${scala.binary.version}` and `jackson-dataformat-yaml`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times Metadata table : - Encode filegroup ID and commit time along with file metadata Table Properties: - Partitioning information/indexing info was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - *Base Files*: file format versions, any changes to any data types, file footers, file names. - *Log Files*: Block structure, content, names. - *Metadata Table*: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - *Table properties*: What's written to hoodie.properties. - *Marker files* : how would we treat these? The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times Metadata table : - Encode filegroup ID and commit time along with file metadata Table Properties: - Partitioning information/indexing info > Format changes for Hudi
[GitHub] [hudi] hudi-bot commented on pull request #9347: Upgrade aws java sdk to v2
hudi-bot commented on PR #9347: URL: https://github.com/apache/hudi/pull/9347#issuecomment-1663065177 ## CI report: * 4e17424eda9aa3bd50841ebc0f8846305b27f6d2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19015) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9347: Upgrade aws java sdk to v2
hudi-bot commented on PR #9347: URL: https://github.com/apache/hudi/pull/9347#issuecomment-1663059384 ## CI report: * 4e17424eda9aa3bd50841ebc0f8846305b27f6d2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.
hudi-bot commented on PR #9324: URL: https://github.com/apache/hudi/pull/9324#issuecomment-1663053455 ## CI report: * 98e49fad21b4c7b1151e96c7a72b18caf5014a7f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18933) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18949) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18965) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=18983) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19014) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] mansipp opened a new pull request, #9347: Upgrade aws java sdk to v2
mansipp opened a new pull request, #9347: URL: https://github.com/apache/hudi/pull/9347 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ ### Risk level (write none, low medium or high below) _If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] amrishlal commented on pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.
amrishlal commented on PR #9324: URL: https://github.com/apache/hudi/pull/9324#issuecomment-1663036555 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] amrishlal commented on a diff in pull request #9324: [HUDI-6619] [WIP] Fix hudi-integ-test-bundle dependency on jackson jsk310 package.
amrishlal commented on code in PR #9324: URL: https://github.com/apache/hudi/pull/9324#discussion_r1282461432 ## packaging/hudi-integ-test-bundle/pom.xml: ## @@ -319,12 +319,19 @@ com.fasterxml.jackson.module jackson-module-scala_${scala.binary.version} + ${fasterxml.jackson.module.scala.version} com.fasterxml.jackson.dataformat jackson-dataformat-yaml - 2.7.4 + ${fasterxml.spark3.version} + + + + com.fasterxml.jackson.datatype + jackson-datatype-jsr310 + ${fasterxml.spark3.version} Review Comment: Based on offline discussion, I modified this to `${fasterxml.version}` to pick up the right jackson package version for a given spark version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9327: [HUDI-6617] make HoodieRecordDelegate implement KryoSerializable
hudi-bot commented on PR #9327: URL: https://github.com/apache/hudi/pull/9327#issuecomment-1662993536 ## CI report: * d875b12ed9e6742f2ad1a2dcd8405d7ab74295a2 UNKNOWN * 06b31f2908be2285ad9e270195684f488cfff2bc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19003) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9335: [HUDI-6627] Fix NPE when spark client writer schema is null
hudi-bot commented on PR #9335: URL: https://github.com/apache/hudi/pull/9335#issuecomment-1662984338 ## CI report: * b1091bdeaf25dcd95f567a8e50c2c6d4dc80fb79 UNKNOWN * 6386813364fd15848d9e63f4b77ce31c63e8a815 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19001) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9287: [HUDI-6592] Flink insert overwrite should support dynamic partition and whole table
hudi-bot commented on PR #9287: URL: https://github.com/apache/hudi/pull/9287#issuecomment-1662984101 ## CI report: * 6d171098737180ae6c8dcdf8cfb717e03359b300 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19002) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dat-vikash commented on issue #8892: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.
dat-vikash commented on issue #8892: URL: https://github.com/apache/hudi/issues/8892#issuecomment-1662943359 Seeing this in flink 1.16.1 and hudi 0.13.1 with MoR tables and single writer (flink) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs
[ https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krishen Bhan updated HUDI-6596: --- Description: h1. Issue The existing rollback API in 0.14 [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877] executes a rollback plan, either taking in an existing rollback plan provided by the caller for a previous rollback or attempt, or scheduling a new rollback instant if none is provided. Currently it is not safe for two concurrent jobs to call this API (when skipLocking=False and the callers aren't already holding a lock), as this can lead to an issue where multiple rollback requested plans are created or two jobs are executing the same rollback instant at the same time. h1. Proposed change One way to resolve this issue is to refactor this rollback function such that if skipLocking=false, the following steps are followed # Acquire the table lock # Reload the active timeline # Look at the active timeline to see if there is a inflight rollback instant from a previous rollback attempt, if it exists then assign this is as the rollback plan to execute. Also, check if a pending rollback plan was passed in by caller. Then it executes the following steps depending on whether the caller passed a pending rollback instant plan. ## [a] If a pending inflight rollback plan was passed in by caller, then check that there is a previous attempted rollback instant on timeline (and that the instant times match) and continue to use this rollback plan. If that isn't the case, then raise a rollback exception since this means another job has concurrently already executed this plan. Note that in a valid HUDI dataset there can be at most one rollback instant for a corresponding commit instant, which is why if we no longer see a pending rollback in timeline in this phase we can safely assume that it had already been executed to completion. ## [b] If no pending inflight rollback plan was passed in by caller and no pending rollback instant was found in timeline earlier, then schedule a new rollback plan # Now that a rollback plan and requested rollback instant time has been assigned, check for an active heartbeat for the rollback instant time. If there is one, then abort the rollback as that means there is a concurrent job executing that rollback. If not, then start a heartbeat for that rollback instant time. # Release the table lock # Execute the rollback plan and complete the rollback instant. Regardless of whether this succeeds or fails with an exception, close the heartbeat. This increases the chance that the next job that tries to call this rollback API will follow through with the rollback and not abort due to an active previous heartbeat * These steps will only be enforced for skipLocking=false, since if skipLocking=true then that means the caller may already be explicitly holding a table lock. In this case, acquiring the lock again in step (1) will fail. * Acquiring a lock and reloading timeline for (1-3) will guard against data race conditions where another job calls this rollback API at same time and schedules its own rollback plan and instant. This is since if no rollback has been attempted before for this instant, then before step (1), there is a window of time where another concurrent rollback job could have scheduled a rollback plan, failed execution, and cleaned up heartbeat, all while the current rollback job is running. As a result, even if the current job was passed in an empty pending rollback plan, it still needs to check the active timeline to ensure that no new rollback pending instant has been created. * Using a heartbeat will signal to other callers in other jobs that there is another job already executing this rollback. Checking for expired heartbeat and (re)-starting the heartbeat has to be done under a lock, so that multiple jobs don't each start it at the same time and assume that they are the only ones that are heartbeating. * The table lock is no longer needed after (5), since it can now be safely assumed that no other job (calling this rollback API) will execute this rollback instant. One example implementation to achieve this: {code:java} @Deprecated public boolean rollback(final String commitInstantTime, Option pendingRollbackInfo, boolean skipLocking, Option rollbackInstantTimeOpt) throws HoodieRollbackException { final Timer.Context timerContext = this.metrics.getRollbackCtx(); final Option commitInstantOpt; final HoodieTable table; try { table = createTable(config, hadoopConf); } catch (Exception e) { throw new HoodieRollbackException("Failed to initalize table for rollback " + config.getBasePath() + " commits " + commitInstantTime, e); } final String rollbackInstantTime; final boolean deleteInstants
[jira] [Updated] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs
[ https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krishen Bhan updated HUDI-6596: --- Description: h1. Issue The existing rollback API in 0.14 [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877] executes a rollback plan, either taking in an existing rollback plan provided by the caller for a previous rollback or attempt, or scheduling a new rollback instant if none is provided. Currently it is not safe for two concurrent jobs to call this API (when skipLocking=False and the callers aren't already holding a lock), as this can lead to an issue where multiple rollback requested plans are created or two jobs are executing the same rollback instant at the same time. h1. Proposed change One way to resolve this issue is to refactor this rollback function such that if skipLocking=false, the following steps are followed # Acquire the table lock # Reload the active timeline # Look at the active timeline to see if there is a inflight rollback instant from a previous rollback attempt, if it exists then assign this is as the rollback plan to execute (at first). Also, check if a pending rollback plan was passed in by caller. Then it executes the following steps depending on whether the caller passed a pending rollback instant plan. ## [a] If a pending inflight rollback plan was passed in by caller, then check that there is a previous attempted rollback instant on timeline (and that the instant times match) and continue to use this rollback plan. If that isn't the case, then raise a rollback exception since this means another job has concurrently already executed this plan. Note that in a valid HUDI dataset there can be at most one rollback instant for a corresponding commit instant, which is why if we no longer see a pending rollback in timeline in this phase we can safely assume that it had already been executed to completion. ## [b] If no pending inflight rollback plan was passed in by caller and no pending rollback instant was found in timeline earlier, then schedule a new rollback plan # Now that a rollback plan and requested rollback instant time has been assigned, check for an active heartbeat for the rollback instant time. If there is one, then abort the rollback as that means there is a concurrent job executing that rollback. If not, then start a heartbeat for that rollback instant time. # Release the table lock # Execute the rollback plan and complete the rollback instant. Regardless of whether this succeeds or fails with an exception, close the heartbeat. This increases the chance that the next job that tries to call this rollback API will follow through with the rollback and not abort due to an active previous heartbeat * These steps will only be enforced for skipLocking=false, since if skipLocking=true then that means the caller may already be explicitly holding a table lock. In this case, acquiring the lock again in step (1) will fail. * Acquiring a lock and reloading timeline for (1-3) will guard against data race conditions where another job calls this rollback API at same time and schedules its own rollback plan and instant. This is since if no rollback has been attempted before for this instant, then before step (1), there is a window of time where another concurrent rollback job could have scheduled a rollback plan, failed execution, and cleaned up heartbeat, all while the current rollback job is running. As a result, even if the current job was passed in an empty pending rollback plan, it still needs to check the active timeline to ensure that no new rollback pending instant has been created. * Using a heartbeat will signal to other callers in other jobs that there is another job already executing this rollback. Checking for expired heartbeat and (re)-starting the heartbeat has to be done under a lock, so that multiple jobs don't each start it at the same time and assume that they are the only ones that are heartbeating. * The table lock is no longer needed after (5), since it can now be safely assumed that no other job (calling this rollback API) will execute this rollback instant. One example implementation to achieve this: {code:java} @Deprecated public boolean rollback(final String commitInstantTime, Option pendingRollbackInfo, boolean skipLocking, Option rollbackInstantTimeOpt) throws HoodieRollbackException { final Timer.Context timerContext = this.metrics.getRollbackCtx(); final Option commitInstantOpt; final HoodieTable table; try { table = createTable(config, hadoopConf); } catch (Exception e) { throw new HoodieRollbackException("Failed to initalize table for rollback " + config.getBasePath() + " commits " + commitInstantTime, e); } final String rollbackInstantTime; final boolean del
[jira] [Updated] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs
[ https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krishen Bhan updated HUDI-6596: --- Description: h1. Issue The existing rollback API in 0.14 [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877] executes a rollback plan, either taking in an existing rollback plan provided by the caller for a previous rollback or attempt, or scheduling a new rollback instant if none is provided. Currently it is not safe for two concurrent jobs to call this API (when skipLocking=False and the callers aren't already holding a lock), as this can lead to an issue where multiple rollback requested plans are created or two jobs are executing the same rollback instant at the same time. h1. Proposed change One way to resolve this issue is to refactor this rollback function such that if skipLocking=false, the following steps are followed # Acquire the table lock # Reload the active timeline # Look at the active timeline to see if there is a inflight rollback instant from a previous rollback attempt, if it exists then assign this is as the rollback plan to execute. Also, check if a pending rollback plan was passed in by caller. Then it executes the following steps depending on whether the caller passed a pending rollback instant plan. ## [a] If a pending inflight rollback plan was passed in by caller, then check that there is a previous attempted rollback instant on timeline (and that the instant times match) and continue to use this rollback plan. If that isn't the case, then raise a rollback exception since this means another job has concurrently already executed this plan. Note that in a valid HUDI dataset there can be at most one rollback instant for a corresponding commit instant, which is why if we no longer see a pending rollback in timeline in this phase we can safely assume that it had already been executed to completion. ## [b] If no pending inflight rollback plan was passed in by caller and no pending rollback instant was found in timeline earlier, then schedule a new rollback plan # Now that a rollback plan and requested rollback instant time has been assigned, check for an active heartbeat for the rollback instant time. If there is one, then abort the rollback as that means there is a concurrent job executing that rollback. If not, then start a heartbeat for that rollback instant time. # Release the table lock # Execute the rollback plan and complete the rollback instant. Regardless of whether this succeeds or fails with an exception, close the heartbeat. This increases the chance that the next job that tries to call this rollback API will follow through with the rollback and not abort due to an active previous heartbeat * These steps will only be enforced for skipLocking=false, since if skipLocking=true then that means the caller may already be explicitly holding a table lock. In this case, acquiring the lock again in step (1) will fail. * Acquiring a lock and reloading timeline for (1-3) will guard against data race conditions where another job calls this rollback API at same time and schedules its own rollback plan and instant. This is since if no rollback has been attempted before for this instant, then before step (1), there is a window of time where another concurrent rollback job could have scheduled a rollback plan, failed execution, and cleaned up heartbeat, all while the current rollback job is running. As a result, even if the current job was passed in an empty pending rollback plan, it still needs to check the active timeline to ensure that no new rollback pending instant has been created. * Using a heartbeat will signal to other callers in other jobs that there is another job already executing this rollback. Checking for expired heartbeat and (re)-starting the heartbeat has to be done under a lock, so that multiple jobs don't each start it at the same time and assume that they are the only ones that are heartbeating. * The table lock is no longer needed after (5), since it can now be safely assumed that no other job (calling this rollback API) will execute this rollback instant. One example implementation to achieve this: {code:java} @Deprecated public boolean rollback(final String commitInstantTime, Option pendingRollbackInfo, boolean skipLocking, Option rollbackInstantTimeOpt) throws HoodieRollbackException { final Timer.Context timerContext = this.metrics.getRollbackCtx(); final Option commitInstantOpt; final HoodieTable table; try { table = createTable(config, hadoopConf); } catch (Exception e) { throw new HoodieRollbackException("Failed to initalize table for rollback " + config.getBasePath() + " commits " + commitInstantTime, e); } final String rollbackInstantTime; final boolean deleteInstants
[jira] [Updated] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs
[ https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krishen Bhan updated HUDI-6596: --- Description: h1. Issue The existing rollback API in 0.14 [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877] executes a rollback plan, either taking in an existing rollback plan provided by the caller for a previous rollback or attempt, or scheduling a new rollback instant if none is provided. Currently it is not safe for two concurrent jobs to call this API (when skipLocking=False and the callers aren't already holding a lock), as this can lead to an issue where multiple rollback requested plans are created or two jobs are executing the same rollback instant at the same time. h1. Proposed change One way to resolve this issue is to refactor this rollback function such that if skipLocking=false, the following steps are followed # Acquire the table lock # Reload the active timeline # Look at the active timeline to see if there is a inflight rollback instant from a previous rollback attempt, if it exists then assign this is as the rollback plan to execute. Also, check if a pending rollback plan was passed in by caller. Then it executes the following steps depending on whether the caller passed a pending rollback instant plan ## [a] If a pending inflight rollback plan was passed in by caller, then check that there is a previous attempted rollback instant on timeline (and that the instant times match) and continue to use this rollback plan. If that isn't the case, then raise a rollback exception since this means another job has concurrently already executed this plan. Note that in a valid HUDI dataset there can be at most one rollback instant for a corresponding commit instant, which is why if we no longer see a pending rollback in timeline in this phase we can safely assume that it had already been executed to completion. ## [b] If no pending inflight rollback plan was passed in by caller then schedule a new rollback plan if no pending rollback instant was found in timeline earlier. # Now that a rollback plan and requested rollback instant time has been assigned, check for an active heartbeat for the rollback instant time. If there is one, then abort the rollback as that means there is a concurrent job executing that rollback. If not, then start a heartbeat for that rollback instant time. # Release the table lock # Execute the rollback plan and complete the rollback instant. Regardless of whether this succeeds or fails with an exception, close the heartbeat. This increases the chance that the next job that tries to call this rollback API will follow through with the rollback and not abort due to an active previous heartbeat * These steps will only be enforced for skipLocking=false, since if skipLocking=true then that means the caller may already be explicitly holding a table lock. In this case, acquiring the lock again in step (1) will fail. * Acquiring a lock and reloading timeline for (1-3) will guard against data race conditions where another job calls this rollback API at same time and schedules its own rollback plan and instant. This is since if no rollback has been attempted before for this instant, then before step (1), there is a window of time where another concurrent rollback job could have scheduled a rollback plan, failed execution, and cleaned up heartbeat, all while the current rollback job is running. As a result, even if the current job was passed in an empty pending rollback plan, it still needs to check the active timeline to ensure that no new rollback pending instant has been created. * Using a heartbeat will signal to other callers in other jobs that there is another job already executing this rollback. Checking for expired heartbeat and (re)-starting the heartbeat has to be done under a lock, so that multiple jobs don't each start it at the same time and assume that they are the only ones that are heartbeating. * The table lock is no longer needed after (5), since it can now be safely assumed that no other job (calling this rollback API) will execute this rollback instant. One example implementation to achieve this: {code:java} @Deprecated public boolean rollback(final String commitInstantTime, Option pendingRollbackInfo, boolean skipLocking, Option rollbackInstantTimeOpt) throws HoodieRollbackException { final Timer.Context timerContext = this.metrics.getRollbackCtx(); final Option commitInstantOpt; final HoodieTable table; try { table = createTable(config, hadoopConf); } catch (Exception e) { throw new HoodieRollbackException("Failed to initalize table for rollback " + config.getBasePath() + " commits " + commitInstantTime, e); } final String rollbackInstantTime; final boolean deleteInstantsDu
[jira] [Updated] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs
[ https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krishen Bhan updated HUDI-6596: --- Description: h1. Issue The existing rollback API in 0.14 [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877] executes a rollback plan, either taking in an existing rollback plan provided by the caller for a previous rollback or attempt, or scheduling a new rollback instant if none is provided. Currently it is not safe for two concurrent jobs to call this API (when skipLocking=False and the callers aren't already holding a lock), as this can lead to an issue where multiple rollback requested plans are created or two jobs are executing the same rollback instant at the same time. h1. Proposed change One way to resolve this issue is to refactor this rollback function such that if skipLocking=false, the following steps are followed # Acquire the table lock # Reload the active timeline # Look at the active timeline to see if there is a inflight rollback instant from a previous rollback attempt, if it exists then assign this is as the rollback plan to execute. Also, check if a pending rollback plan was passed in by caller. Then it executes the following steps depending on whether the caller passed a pending rollback instant plan ## [a] If a pending inflight rollback plan was passed in by caller, then check that there is a previous attempted rollback instant on timeline (and that the instant times match) and continue to use this rollback plan. If that isn't the case, then raise a rollback exception since this means another job has concurrently already executed this plan. Note that in a valid HUDI dataset there can be at most one rollback instant for a corresponding commit instant, which is why if we no longer see a pending rollback in timeline in this phase we can safely assume that it had already been executed to completion. ## [b] If no pending inflight rollback plan was passed in by caller then schedule a new rollback plan if no pending rollback instant was found in timeline earlier. # Now that a rollback plan and requested rollback instant time has been assigned, check for an active heartbeat for the rollback instant time. If there is one, then abort the rollback as that means there is a concurrent job executing that rollback. If not, then start a heartbeat for that rollback instant time. # Release the table lock # Execute the rollback plan and complete the rollback instant. Whether this succeeds or fails with an exception, close the heartbeat. This increases the chance that the next job that tries to call this rollback API will follow through with the rollback and not abort due to an active previous heartbeat * These steps will only be enforced for skipLocking=false, since if skipLocking=true then that means the caller may already be explicitly holding a table lock. In this case, acquiring the lock again in step (1) will fail. * Acquiring a lock and reloading timeline for (1-3) will guard against data race conditions where another job calls this rollback API at same time and schedules its own rollback plan and instant. This is since if no rollback has been attempted before for this instant, then before step (1), there is a window of time where another concurrent rollback job could have scheduled a rollback plan, failed execution, and cleaned up heartbeat, all while the current rollback job is running. As a result, even if the current job was passed in an empty pending rollback plan, it still needs to check the active timeline to ensure that no new rollback pending instant has been created. * Using a heartbeat will signal to other callers in other jobs that there is another job already executing this rollback. Checking for expired heartbeat and (re)-starting the heartbeat has to be done under a lock, so that multiple jobs don't each start it at the same time and assume that they are the only ones that are heartbeating. * The table lock is no longer needed after (5), since it can now be safely assumed that no other job (calling this rollback API) will execute this rollback instant. One example implementation to achieve this: {code:java} @Deprecated public boolean rollback(final String commitInstantTime, Option pendingRollbackInfo, boolean skipLocking, Option rollbackInstantTimeOpt) throws HoodieRollbackException { final Timer.Context timerContext = this.metrics.getRollbackCtx(); final Option commitInstantOpt; final HoodieTable table; try { table = createTable(config, hadoopConf); } catch (Exception e) { throw new HoodieRollbackException("Failed to initalize table for rollback " + config.getBasePath() + " commits " + commitInstantTime, e); } final String rollbackInstantTime; final boolean deleteInstantsDuringRollback;
[GitHub] [hudi] bhasudha commented on a diff in pull request #9338: [DOCS] Update bootstrap page
bhasudha commented on code in PR #9338: URL: https://github.com/apache/hudi/pull/9338#discussion_r1282377412 ## website/docs/migration_guide.md: ## @@ -69,12 +79,28 @@ for partition in [list of partitions in source table] { } ``` -**Option 3** +**Option 3 using Spark SQL CALL Procedure** + +Refer to [Bootstrap procedure](https://hudi.apache.org/docs/next/procedures#bootstrap) for more details. + +**Option 4 using Hudi CLI** + Write your own custom logic of how to load an existing table into a Hudi managed one. Please read about the RDD API [here](/docs/quick-start-guide). Using the bootstrap run CLI. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be fired by via `cd hudi-cli && ./hudi-cli.sh`. ```java hudi->bootstrap run --srcPath /tmp/source_table --targetPath /tmp/hoodie/bootstrap_table --tableName bootstrap_table --tableType COPY_ON_WRITE --rowKeyField ${KEY_FIELD} --partitionPathField ${PARTITION_FIELD} --sparkMaster local --hoodieConfigs hoodie.datasource.write.hive_style_partitioning=true --selectorClass org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector ``` -Unlike deltaStream, FULL_RECORD or METADATA_ONLY is set with --selectorClass, see detalis with help "bootstrap run". +Unlike Hudi Streamer, FULL_RECORD or METADATA_ONLY is set with --selectorClass, see details with help "bootstrap run". + + +## Configs + +Here are the basic configs that control bootstrapping. + +| Config Name | Default| Description | +| | -- | --- | +| hoodie.bootstrap.base.path | N/A **(Required)** | Base path of the dataset that needs to be bootstrapped as a Hudi table`Config Param: BASE_PATH``Since Version: 0.6.0` | + +By default, with only `hoodie.bootstrap.base.path` being provided METADATA_ONLY mode is selected. For other options, please refer [bootstrap configs](https://hudi.apache.org/docs/next/configurations#Bootstrap-Configs) for more details. Review Comment: will do -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on pull request #9001: [HUDI-6402] Hudi Spark3.3 and upper version need close JavaTimeModule For JsonUtils
xushiyan commented on PR #9001: URL: https://github.com/apache/hudi/pull/9001#issuecomment-1662902015 > need close JavaTimeModule For JsonUtils @xuzifu666 can you help me understand what the PR title means? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9276: [HUDI-6568] Hudi Spark Integration Redesign
hudi-bot commented on PR #9276: URL: https://github.com/apache/hudi/pull/9276#issuecomment-1662877299 ## CI report: * 662f3b320ab6ea06462bad9a4448add1ec2f380a UNKNOWN * 87e8f76e3d97d5b3b2fc10fe7704395575cc1b79 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19005) * f179c083ce951ed076bc382ee252c89d8e07d49d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19013) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jonvex commented on a diff in pull request #9338: [DOCS] Update bootstrap page
jonvex commented on code in PR #9338: URL: https://github.com/apache/hudi/pull/9338#discussion_r1282347510 ## website/docs/migration_guide.md: ## @@ -56,11 +64,13 @@ spark-submit --master local \ --hoodie-conf hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \ --hoodie-conf hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider \ Review Comment: I don't think we need `hoodie-conf hoodie.bootstrap.full.input.provider` in the example ## website/docs/migration_guide.md: ## @@ -69,12 +79,28 @@ for partition in [list of partitions in source table] { } ``` -**Option 3** +**Option 3 using Spark SQL CALL Procedure** + +Refer to [Bootstrap procedure](https://hudi.apache.org/docs/next/procedures#bootstrap) for more details. + +**Option 4 using Hudi CLI** + Write your own custom logic of how to load an existing table into a Hudi managed one. Please read about the RDD API [here](/docs/quick-start-guide). Using the bootstrap run CLI. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be fired by via `cd hudi-cli && ./hudi-cli.sh`. ```java hudi->bootstrap run --srcPath /tmp/source_table --targetPath /tmp/hoodie/bootstrap_table --tableName bootstrap_table --tableType COPY_ON_WRITE --rowKeyField ${KEY_FIELD} --partitionPathField ${PARTITION_FIELD} --sparkMaster local --hoodieConfigs hoodie.datasource.write.hive_style_partitioning=true --selectorClass org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector ``` -Unlike deltaStream, FULL_RECORD or METADATA_ONLY is set with --selectorClass, see detalis with help "bootstrap run". +Unlike Hudi Streamer, FULL_RECORD or METADATA_ONLY is set with --selectorClass, see details with help "bootstrap run". + + +## Configs + +Here are the basic configs that control bootstrapping. + +| Config Name | Default| Description | +| | -- | --- | +| hoodie.bootstrap.base.path | N/A **(Required)** | Base path of the dataset that needs to be bootstrapped as a Hudi table`Config Param: BASE_PATH``Since Version: 0.6.0` | + +By default, with only `hoodie.bootstrap.base.path` being provided METADATA_ONLY mode is selected. For other options, please refer [bootstrap configs](https://hudi.apache.org/docs/next/configurations#Bootstrap-Configs) for more details. Review Comment: I think adding `hoodie.bootstrap.mode.selector.regex.mode`, `hoodie.bootstrap.mode.selector`, `hoodie.bootstrap.mode.selector.regex` to the simple configs would be helpful. At a minimum at least `hoodie.bootstrap.mode.selector` should be added -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org