[jira] [Updated] (HUDI-1623) Support start_commit_time & end_commit_times for serializable incremental pull
[ https://issues.apache.org/jira/browse/HUDI-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-1623: - Reviewers: Vinoth Chandar > Support start_commit_time & end_commit_times for serializable incremental pull > -- > > Key: HUDI-1623 > URL: https://issues.apache.org/jira/browse/HUDI-1623 > Project: Apache Hudi > Issue Type: Improvement > Components: Common Core >Reporter: Nishith Agarwal >Assignee: Danny Chen >Priority: Critical > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] codope commented on a diff in pull request #8233: [HUDI-5956] Simple repair spark sql dag ui display problem
codope commented on code in PR #8233: URL: https://github.com/apache/hudi/pull/8233#discussion_r1298071684 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala: ## @@ -123,6 +126,24 @@ object HoodieSparkSqlWriter { streamingWritesParamsOpt: Option[StreamingWriteParams] = Option.empty, hoodieWriteClient: Option[SparkRDDWriteClient[_]] = Option.empty): (Boolean, HOption[String], HOption[String], HOption[String], SparkRDDWriteClient[_], HoodieTableConfig) = { +//TODO reuse DataWritingCommand sparkPlan, reduce the number of sql list in SPARK UI SQL tag, rendering raw DAG Review Comment: Will it incure some overhead if we don't reuse? Why not complete the TODO in this PR itself? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9473: [HUDI-6724] - Defaulting previous Instant time to init time to enable full read of initial commit
hudi-bot commented on PR #9473: URL: https://github.com/apache/hudi/pull/9473#issuecomment-1683412312 ## CI report: * ccdd0648b943bb2f5c3325c69887f4d9d4d7a117 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9473: [HUDI-6724] - Defaulting previous Instant time to init time to enable full read of initial commit
hudi-bot commented on PR #9473: URL: https://github.com/apache/hudi/pull/9473#issuecomment-1683419250 ## CI report: * ccdd0648b943bb2f5c3325c69887f4d9d4d7a117 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19347) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
hudi-bot commented on PR #9472: URL: https://github.com/apache/hudi/pull/9472#issuecomment-1683419217 ## CI report: * dba536eaf3fbc3cade137d7c9d24c705e8263ad9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19346) * 13a06dff7a03d861232980b79baf924e31d55ff7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
hudi-bot commented on PR #9472: URL: https://github.com/apache/hudi/pull/9472#issuecomment-1683379797 ## CI report: * dba536eaf3fbc3cade137d7c9d24c705e8263ad9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19346) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lokesh-lingarajan-0310 opened a new pull request, #9473: [HUDI-6724] - Defaulting previous Instant time to init time to enable full read of initial commit
lokesh-lingarajan-0310 opened a new pull request, #9473: URL: https://github.com/apache/hudi/pull/9473 This will happen in new onboarding as the old code will initialize prev=start = firstcommit-time, incremental read following this will always get entries > prev, which case we will skip part of first commit in processing. ### Change Logs Initialized prevInstance of commit to default 000 to avoid skipping parts of first commit ### Impact Medium ### Risk level (write none, low medium or high below) Medium ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
hudi-bot commented on PR #9472: URL: https://github.com/apache/hudi/pull/9472#issuecomment-1683373888 ## CI report: * dba536eaf3fbc3cade137d7c9d24c705e8263ad9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jiangzzwy opened a new issue, #9474: ClassNotFoundException: MergeOnReadInputSplit
jiangzzwy opened a new issue, #9474: URL: https://github.com/apache/hudi/issues/9474 ### Environment - Flink:1.17.1 - hudi:1.14.0-rc - hadoop:3.2.2 ### init.sql script ` SET 'state.checkpoints.dir' = 'hdfs:///hudi/checkpoints/'; SET 'execution.checkpointing.interval' = '20s'; SET 'execution.checkpointing.min-pause' = '5s'; SET 'execution.checkpointing.max-concurrent-checkpoints' = '1'; add jar '/export/server/flink-1.17.1/hudi-flink1.17-bundle-0.14.0-rc1.jar; create table t_hudi_user( id BIGINT, name STRING, age INT, sex BOOLEAN, city String, birth timestamp(3) ) PARTITIONED BY (birth) WITH ( 'connector' = 'hudi', 'hoodie.datasource.write.recordkey.field' = 'id', 'path' = 'hdfs://CentOS:9000/hudi/t_hudi_user', 'table.type' = 'MERGE_ON_READ', 'compaction.trigger.strategy' = 'num_or_time', 'compaction.delta_commits' = '3', 'compaction.delta_seconds' = '300', 'hoodie.datasource.write.hive_style_partitioning' = 'true', 'write.datetime.partitioning'='true', 'write.partition.format'='-MM-dd', 'hive_sync.assume_date_partitioning' = 'true', 'hive_sync.mode' = 'hms', 'write.precombine.field' = 'birth', 'changelog.enabled' = 'true', 'read.streaming.enabled' = 'true', 'read.streaming.check-interval' = '3', 'compaction.tasks' = '2', 'hive_sync.enable' = 'true', 'hive_sync.table' = 't_hudi_user', 'hive_sync.db' = 'default', 'hive_sync.metastore.uris' = 'thrift://192.168.42.129:9083', 'hoodie.datasource.hive_sync.support_timestamp' = 'true' ); ` when i execute command as follow,the console termnal raise such error `java.lang.ClassNotFoundException: org.apache.hudi.table.format.mor.MergeOnReadInputSplit`.I'm sure the `MergeOnReadInputSplit` class has already complied in ·hudi-flink1.17-bundle-0.14.0-rc1.jar` jar file. ` ![image](https://github.com/apache/hudi/assets/23492991/33062891-da14-40cd-b591-2c37575a129f) ![image](https://github.com/apache/hudi/assets/23492991/d00114af-aaae-4212-a7b1-a96d64a940f6) But inserting is okay, querying is not, which makes me feel very strange!! ![image](https://github.com/apache/hudi/assets/23492991/f5d506ae-47d9-4cef-b5dc-a4d55c68c0d5) I had try a lower flink verison 1.14.x has no tihs problem -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-3625) [RFC-60] Optimized storage layout for cloud object stores
[ https://issues.apache.org/jira/browse/HUDI-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shawn Chang reassigned HUDI-3625: - Assignee: Shawn Chang (was: Udit Mehrotra) > [RFC-60] Optimized storage layout for cloud object stores > - > > Key: HUDI-3625 > URL: https://issues.apache.org/jira/browse/HUDI-3625 > Project: Apache Hudi > Issue Type: Epic > Components: core >Reporter: Udit Mehrotra >Assignee: Shawn Chang >Priority: Major > Labels: hudi-umbrellas, pull-request-available > Fix For: 1.0.0 > > > Amazon S3 among other cloud object stores, throttle requests based on object > prefix => > [https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/]. > Hudi follows the traditional Hive storage layout, with files being stored > under separate partition paths under a common table path/prefix. This > introduces the potential for throttling because of request limits being > reached for the common table path/prefix, when writing significant number of > files concurrently. > We propose implementing an alternate storage layout, that would be more > suitable for cloud object stores like S3 to avoid running into throttling > issues as the data scales. At a high level, we need to be able to distribute > data files evenly across randomly generated prefixes, so that request limits > get distributed across those prefixes, instead of a single table prefix. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6724) Initializing prevInstance to HoodieTimeline.INIT_INSTANT_TS to avoid partial reading of first commit
[ https://issues.apache.org/jira/browse/HUDI-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6724: - Labels: pull-request-available (was: ) > Initializing prevInstance to HoodieTimeline.INIT_INSTANT_TS to avoid partial > reading of first commit > > > Key: HUDI-6724 > URL: https://issues.apache.org/jira/browse/HUDI-6724 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Lingarajan >Priority: Major > Labels: pull-request-available > > Since object based incr jobs now have batching with in the commit, we can > end-up in a situation for the first commit where prevInstance is same as > startInstance according to existing code for batches within the first commit. > In this scenario when we incremental query rows > prevInstance, we will skip > the first commit as startInstance is also pointing to the same commit. > This is due to defaulting prevInstance to startInstance in > generateQueryInfo API. > Fix is to have this default to HoodieTimeline.INIT_INSTANT_TS so batching can > continue on the first commit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6724) Initializing prevInstance to HoodieTimeline.INIT_INSTANT_TS to avoid partial reading of first commit
Lokesh Lingarajan created HUDI-6724: --- Summary: Initializing prevInstance to HoodieTimeline.INIT_INSTANT_TS to avoid partial reading of first commit Key: HUDI-6724 URL: https://issues.apache.org/jira/browse/HUDI-6724 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Lingarajan Since object based incr jobs now have batching with in the commit, we can end-up in a situation for the first commit where prevInstance is same as startInstance according to existing code for batches within the first commit. In this scenario when we incremental query rows > prevInstance, we will skip the first commit as startInstance is also pointing to the same commit. This is due to defaulting prevInstance to startInstance in generateQueryInfo API. Fix is to have this default to HoodieTimeline.INIT_INSTANT_TS so batching can continue on the first commit -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6723) Prototype and benchmark event-time based in MOR log merging
[ https://issues.apache.org/jira/browse/HUDI-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6723: Fix Version/s: 1.0.0 > Prototype and benchmark event-time based in MOR log merging > --- > > Key: HUDI-6723 > URL: https://issues.apache.org/jira/browse/HUDI-6723 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6723) Prototype and benchmark event-time based in MOR log merging
Ethan Guo created HUDI-6723: --- Summary: Prototype and benchmark event-time based in MOR log merging Key: HUDI-6723 URL: https://issues.apache.org/jira/browse/HUDI-6723 Project: Apache Hudi Issue Type: New Feature Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6723) Prototype and benchmark event-time based in MOR log merging
[ https://issues.apache.org/jira/browse/HUDI-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6723: --- Assignee: Ethan Guo > Prototype and benchmark event-time based in MOR log merging > --- > > Key: HUDI-6723 > URL: https://issues.apache.org/jira/browse/HUDI-6723 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6719) Fix data inconsistency issues caused by concurrent clustering and delete partition.
[ https://issues.apache.org/jira/browse/HUDI-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6719: - Labels: pull-request-available (was: ) > Fix data inconsistency issues caused by concurrent clustering and delete > partition. > --- > > Key: HUDI-6719 > URL: https://issues.apache.org/jira/browse/HUDI-6719 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ma Jian >Priority: Major > Labels: pull-request-available > > Related issue: https://issues.apache.org/jira/browse/HUDI-5553 > The specific problem is that when concurrent replace commit operations are > executed, two replace commits may point to the same file ID, resulting in a > duplicate key error. The existing issue solved the problem of scheduling > delete partition while there are pending clustering or compaction operations, > which will be prevented in this case. However, this solution is not perfect > and may still cause data inconsistency if a clustering plan is scheduled > before the delete partition is committed. Because validation is one-way.In > this case, both replace commits will still contain duplicate keys, and the > table will become inconsistent when both plans are committed. This is very > fatal, and there are other similar scenarios that may bypass the validation > of the existing issue. Moreover, the existing issue is at the partition level > and is not precise enough. > Here is my solution: > !https://intranetproxy.alipay.com/skylark/lark/0/2023/png/62256341/1692328998008-f9dc6530-e44e-43e7-9b75-d760b55b3dfa.png|width=335,id=WXCCX! > As shown in the figure, both drop partition and clustering will go through a > period of time that is not registered to the timeline, which is the scenario > that the previous issue did not solve. Here, I register the replace file IDs > involved in each replace commit to the active timeline (the replace commit > timeline that has been submitted has saved partitionToReplaceFileIds, and > only pending requests need to be processed). Since in the case of Spark SQL, > delete partition creates a requested commit in advance during write, which is > inconvenient to handle, I save the pending replace commit's > partitionToReplaceFileIds information to the inflight commit's extra > metadata. Therefore, each time drop partition or clustering is executed, it > only needs to read the partitionToReplaceFileIds information in the timeline > after ensuring that the inflight commit information has been saved to the > timeline to ensure that there are no duplicate file IDs and prevent this kind > of error from occurring. > In simple terms, each replace commit will register the replace file ID > information to the timeline whether it is submitted or not, at the same time, > each submission will check this information to ensure that it will not be > repeated, so that any replace commit containing this file ID will be > prevented, ensuring that there are no duplicate keys. > When this idea is also implemented on the compaction commit, the modification > involved in the related issue can be removed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6723) Prototype and benchmark event-time based in MOR log merging
[ https://issues.apache.org/jira/browse/HUDI-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6723: Status: In Progress (was: Open) > Prototype and benchmark event-time based in MOR log merging > --- > > Key: HUDI-6723 > URL: https://issues.apache.org/jira/browse/HUDI-6723 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] majian1998 opened a new pull request, #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.
majian1998 opened a new pull request, #9472: URL: https://github.com/apache/hudi/pull/9472 ### Change Logs Implemented a solution to prevent duplicate key errors in concurrent replace commit operations. Registered the replace file ID information to the timeline for each replace commit, whether it is submitted or not. Saved the pending replace commit's partitionToReplaceFileIds information to the inflight commit's extra metadata. Updated drop partition and clustering operations to read the partitionToReplaceFileIds information in the timeline to ensure no duplicate file IDs. Removed the modification involved in the related issue for compaction commit. ### Impact No public API or user-facing feature changes. ### Risk level (write none, low medium or high below) low ### Documentation Update Related issue: https://issues.apache.org/jira/browse/HUDI-5553 The specific problem is that when concurrent replace commit operations are executed, two replace commits may point to the same file ID, resulting in a duplicate key error. The existing issue solved the problem of scheduling delete partition while there are pending clustering or compaction operations, which will be prevented in this case. However, this solution is not perfect and may still cause data inconsistency if a clustering plan is scheduled before the delete partition is committed. Because validation is one-way.In this case, both replace commits will still contain duplicate keys, and the table will become inconsistent when both plans are committed. This is very fatal, and there are other similar scenarios that may bypass the validation of the existing issue. Moreover, the existing issue is at the partition level and is not precise enough. Here is my solution: ![image](https://github.com/apache/hudi/assets/47964462/6d8a3134-96a5-45ec-8ed0-ed2776b7ed24) As shown in the figure, both drop partition and clustering will go through a period of time that is not registered to the timeline, which is the scenario that the previous issue did not solve. Here, I register the replace file IDs involved in each replace commit to the active timeline (the replace commit timeline that has been submitted has saved partitionToReplaceFileIds, and only pending requests need to be processed). Since in the case of Spark SQL, delete partition creates a requested commit in advance during write, which is inconvenient to handle, I save the pending replace commit's partitionToReplaceFileIds information to the inflight commit's extra metadata. Therefore, each time drop partition or clustering is executed, it only needs to read the partitionToReplaceFileIds information in the timeline after ensuring that the inflight commit information has been saved to the timeline to ensure that there are no duplicate file IDs and prevent this kind of error from occurring. In simple terms, each replace commit will register the replace file ID information to the timeline whether it is submitted or not, at the same time, each submission will check this information to ensure that it will not be repeated, so that any replace commit containing this file ID will be prevented, ensuring that there are no duplicate keys. When this idea is also implemented on the compaction commit, the modification involved in the related issue can be removed. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6539) New LSM tree style archived timeline
[ https://issues.apache.org/jira/browse/HUDI-6539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6539: - Status: Patch Available (was: In Progress) > New LSM tree style archived timeline > > > Key: HUDI-6539 > URL: https://issues.apache.org/jira/browse/HUDI-6539 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6539) New LSM tree style archived timeline
[ https://issues.apache.org/jira/browse/HUDI-6539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6539: - Reviewers: Vinoth Chandar > New LSM tree style archived timeline > > > Key: HUDI-6539 > URL: https://issues.apache.org/jira/browse/HUDI-6539 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6539) New LSM tree style archived timeline
[ https://issues.apache.org/jira/browse/HUDI-6539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6539: - Status: In Progress (was: Open) > New LSM tree style archived timeline > > > Key: HUDI-6539 > URL: https://issues.apache.org/jira/browse/HUDI-6539 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6719) Fix data inconsistency issues caused by concurrent clustering and delete partition.
[ https://issues.apache.org/jira/browse/HUDI-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ma Jian updated HUDI-6719: -- Description: Related issue: https://issues.apache.org/jira/browse/HUDI-5553 The specific problem is that when concurrent replace commit operations are executed, two replace commits may point to the same file ID, resulting in a duplicate key error. The existing issue solved the problem of scheduling delete partition while there are pending clustering or compaction operations, which will be prevented in this case. However, this solution is not perfect and may still cause data inconsistency if a clustering plan is scheduled before the delete partition is committed. Because validation is one-way.In this case, both replace commits will still contain duplicate keys, and the table will become inconsistent when both plans are committed. This is very fatal, and there are other similar scenarios that may bypass the validation of the existing issue. Moreover, the existing issue is at the partition level and is not precise enough. Here is my solution: !https://intranetproxy.alipay.com/skylark/lark/0/2023/png/62256341/1692328998008-f9dc6530-e44e-43e7-9b75-d760b55b3dfa.png|width=335,id=WXCCX! As shown in the figure, both drop partition and clustering will go through a period of time that is not registered to the timeline, which is the scenario that the previous issue did not solve. Here, I register the replace file IDs involved in each replace commit to the active timeline (the replace commit timeline that has been submitted has saved partitionToReplaceFileIds, and only pending requests need to be processed). Since in the case of Spark SQL, delete partition creates a requested commit in advance during write, which is inconvenient to handle, I save the pending replace commit's partitionToReplaceFileIds information to the inflight commit's extra metadata. Therefore, each time drop partition or clustering is executed, it only needs to read the partitionToReplaceFileIds information in the timeline after ensuring that the inflight commit information has been saved to the timeline to ensure that there are no duplicate file IDs and prevent this kind of error from occurring. In simple terms, each replace commit will register the replace file ID information to the timeline whether it is submitted or not, at the same time, each submission will check this information to ensure that it will not be repeated, so that any replace commit containing this file ID will be prevented, ensuring that there are no duplicate keys. When this idea is also implemented on the compaction commit, the modification involved in the related issue can be removed. was: Related issue: https://issues.apache.org/jira/browse/HUDI-5553 The specific problem is that when concurrent replace commit operations are executed, two replace commits may point to the same file ID, resulting in a duplicate key error. The existing issue solved the problem of scheduling delete partition while there are pending clustering or compaction operations, which will be prevented in this case. However, this solution is not perfect and may still cause data inconsistency if a clustering plan is scheduled before the delete partition is committed. Because validation is one-way.In this case, both replace commits will still contain duplicate keys, and the table will become inconsistent when both plans are committed. This is very fatal, and there are other similar scenarios that may bypass the validation of the existing issue. Moreover, the existing issue is at the partition level and is not precise enough. Here is my solution: !https://intranetproxy.alipay.com/skylark/lark/0/2023/png/62256341/1692328998008-f9dc6530-e44e-43e7-9b75-d760b55b3dfa.png|width=335,id=WXCCX! As shown in the figure, both drop partition and clustering will go through a period of time that is not registered to the timeline, which is the scenario that the previous issue did not solve. Here, I register the replace file IDs involved in each replace commit to the active timeline (the replace commit timeline that has been submitted has saved partitionToReplaceFileIds, and only pending requests need to be processed). Since in the case of Spark SQL, delete partition creates a requested commit in advance during write, which is inconvenient to handle, I save the pending replace commit's partitionToReplaceFileIds information to the inflight commit's extra metadata. Therefore, each time drop partition or clustering is executed, it only needs to read the partitionToReplaceFileIds information in the timeline after ensuring that the inflight commit information has been saved to the timeline to ensure that there are no duplicate file IDs and prevent this kind of error from occurring. In simple terms, each replace commit will register the replace file ID information to the timeli
[GitHub] [hudi] hudi-bot commented on pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases
hudi-bot commented on PR #9459: URL: https://github.com/apache/hudi/pull/9459#issuecomment-1683342280 ## CI report: * 170678f0e7c429406a4565d85e77367908c1fb4b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19340) * 13cd8f29dd7aceccb83a9a44aa464d70d55bb57c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19345) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6720) Prototype and benchmark position- and key-based updates and deletes in MOR
[ https://issues.apache.org/jira/browse/HUDI-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6720: Status: In Progress (was: Open) > Prototype and benchmark position- and key-based updates and deletes in MOR > -- > > Key: HUDI-6720 > URL: https://issues.apache.org/jira/browse/HUDI-6720 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6721) Prototype and benchmark partial updates in MOR log merging
[ https://issues.apache.org/jira/browse/HUDI-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6721: Fix Version/s: 1.0.0 > Prototype and benchmark partial updates in MOR log merging > -- > > Key: HUDI-6721 > URL: https://issues.apache.org/jira/browse/HUDI-6721 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6721) Prototype and benchmark partial updates in MOR log merging
[ https://issues.apache.org/jira/browse/HUDI-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6721: Epic Link: HUDI-6722 > Prototype and benchmark partial updates in MOR log merging > -- > > Key: HUDI-6721 > URL: https://issues.apache.org/jira/browse/HUDI-6721 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6721) Prototype and benchmark partial updates in MOR log merging
[ https://issues.apache.org/jira/browse/HUDI-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6721: Status: In Progress (was: Open) > Prototype and benchmark partial updates in MOR log merging > -- > > Key: HUDI-6721 > URL: https://issues.apache.org/jira/browse/HUDI-6721 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6720) Prototype and benchmark position- and key-based updates and deletes in MOR
[ https://issues.apache.org/jira/browse/HUDI-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6720: Fix Version/s: 1.0.0 > Prototype and benchmark position- and key-based updates and deletes in MOR > -- > > Key: HUDI-6720 > URL: https://issues.apache.org/jira/browse/HUDI-6720 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6720) Prototype and benchmark position- and key-based updates and deletes in MOR
[ https://issues.apache.org/jira/browse/HUDI-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6720: Epic Link: HUDI-6722 > Prototype and benchmark position- and key-based updates and deletes in MOR > -- > > Key: HUDI-6720 > URL: https://issues.apache.org/jira/browse/HUDI-6720 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5386) Cleaning conflicts in occ mode
[ https://issues.apache.org/jira/browse/HUDI-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-5386: -- Epic Link: HUDI-1456 Fix Version/s: 0.14.0 0.14.1 > Cleaning conflicts in occ mode > -- > > Key: HUDI-5386 > URL: https://issues.apache.org/jira/browse/HUDI-5386 > Project: Apache Hudi > Issue Type: Bug >Reporter: HunterXHunter >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0, 0.14.1 > > Attachments: image-2022-12-14-11-26-21-995.png, > image-2022-12-14-11-26-37-252.png > > > {code:java} > configuration parameter: > 'hoodie.cleaner.policy.failed.writes' = 'LAZY' > 'hoodie.write.concurrency.mode' = 'optimistic_concurrency_control' {code} > Because `getInstantsToRollback` is not locked, multiple writes get the same > `instantsToRollback`, the same `instant` will be deleted multiple times and > the same `rollback.inflight` will be created multiple times. > !image-2022-12-14-11-26-37-252.png! > !image-2022-12-14-11-26-21-995.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6722) Performance and API improvement on record merging
Ethan Guo created HUDI-6722: --- Summary: Performance and API improvement on record merging Key: HUDI-6722 URL: https://issues.apache.org/jira/browse/HUDI-6722 Project: Apache Hudi Issue Type: New Feature Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6722) Performance and API improvement on record merging
[ https://issues.apache.org/jira/browse/HUDI-6722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6722: Fix Version/s: 1.0.0 > Performance and API improvement on record merging > - > > Key: HUDI-6722 > URL: https://issues.apache.org/jira/browse/HUDI-6722 > Project: Apache Hudi > Issue Type: Epic >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6722) Performance and API improvement on record merging
[ https://issues.apache.org/jira/browse/HUDI-6722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6722: Issue Type: Epic (was: New Feature) > Performance and API improvement on record merging > - > > Key: HUDI-6722 > URL: https://issues.apache.org/jira/browse/HUDI-6722 > Project: Apache Hudi > Issue Type: Epic >Reporter: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6722) Performance and API improvement on record merging
[ https://issues.apache.org/jira/browse/HUDI-6722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6722: --- Assignee: Ethan Guo > Performance and API improvement on record merging > - > > Key: HUDI-6722 > URL: https://issues.apache.org/jira/browse/HUDI-6722 > Project: Apache Hudi > Issue Type: Epic >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases
hudi-bot commented on PR #9459: URL: https://github.com/apache/hudi/pull/9459#issuecomment-1683337770 ## CI report: * 170678f0e7c429406a4565d85e77367908c1fb4b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19340) * 13cd8f29dd7aceccb83a9a44aa464d70d55bb57c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6721) Prototype and benchmark partial updates in MOR log merging
Ethan Guo created HUDI-6721: --- Summary: Prototype and benchmark partial updates in MOR log merging Key: HUDI-6721 URL: https://issues.apache.org/jira/browse/HUDI-6721 Project: Apache Hudi Issue Type: New Feature Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6720) Prototype and benchmark position- and key-based updates and deletes
[ https://issues.apache.org/jira/browse/HUDI-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6720: --- Assignee: Ethan Guo > Prototype and benchmark position- and key-based updates and deletes > --- > > Key: HUDI-6720 > URL: https://issues.apache.org/jira/browse/HUDI-6720 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6720) Prototype and benchmark position- and key-based updates and deletes in MOR
[ https://issues.apache.org/jira/browse/HUDI-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6720: Summary: Prototype and benchmark position- and key-based updates and deletes in MOR (was: Prototype and benchmark position- and key-based updates and deletes) > Prototype and benchmark position- and key-based updates and deletes in MOR > -- > > Key: HUDI-6720 > URL: https://issues.apache.org/jira/browse/HUDI-6720 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6721) Prototype and benchmark partial updates in MOR log merging
[ https://issues.apache.org/jira/browse/HUDI-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo reassigned HUDI-6721: --- Assignee: Ethan Guo > Prototype and benchmark partial updates in MOR log merging > -- > > Key: HUDI-6721 > URL: https://issues.apache.org/jira/browse/HUDI-6721 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6720) Prototype and benchmark position- and key-based updates and deletes
[ https://issues.apache.org/jira/browse/HUDI-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6720: Issue Type: New Feature (was: Task) > Prototype and benchmark position- and key-based updates and deletes > --- > > Key: HUDI-6720 > URL: https://issues.apache.org/jira/browse/HUDI-6720 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6720) Prototype and benchmark position- and key-based updates and deletes
[ https://issues.apache.org/jira/browse/HUDI-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6720: Summary: Prototype and benchmark position- and key-based updates and deletes (was: Benchmark position- and key-based updates and deletes) > Prototype and benchmark position- and key-based updates and deletes > --- > > Key: HUDI-6720 > URL: https://issues.apache.org/jira/browse/HUDI-6720 > Project: Apache Hudi > Issue Type: Task >Reporter: Ethan Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6720) Benchmark position- and key-based updates and deletes
Ethan Guo created HUDI-6720: --- Summary: Benchmark position- and key-based updates and deletes Key: HUDI-6720 URL: https://issues.apache.org/jira/browse/HUDI-6720 Project: Apache Hudi Issue Type: Task Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
hudi-bot commented on PR #9422: URL: https://github.com/apache/hudi/pull/9422#issuecomment-1683332246 ## CI report: * a0db166250fe0220494b18b0c0d343d1a3adae7b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19342) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-6719) Fix data inconsistency issues caused by concurrent clustering and delete partition.
Ma Jian created HUDI-6719: - Summary: Fix data inconsistency issues caused by concurrent clustering and delete partition. Key: HUDI-6719 URL: https://issues.apache.org/jira/browse/HUDI-6719 Project: Apache Hudi Issue Type: Bug Reporter: Ma Jian Related issue: https://issues.apache.org/jira/browse/HUDI-5553 The specific problem is that when concurrent replace commit operations are executed, two replace commits may point to the same file ID, resulting in a duplicate key error. The existing issue solved the problem of scheduling delete partition while there are pending clustering or compaction operations, which will be prevented in this case. However, this solution is not perfect and may still cause data inconsistency if a clustering plan is scheduled before the delete partition is committed. Because validation is one-way.In this case, both replace commits will still contain duplicate keys, and the table will become inconsistent when both plans are committed. This is very fatal, and there are other similar scenarios that may bypass the validation of the existing issue. Moreover, the existing issue is at the partition level and is not precise enough. Here is my solution: !https://intranetproxy.alipay.com/skylark/lark/0/2023/png/62256341/1692328998008-f9dc6530-e44e-43e7-9b75-d760b55b3dfa.png|width=335,id=WXCCX! As shown in the figure, both drop partition and clustering will go through a period of time that is not registered to the timeline, which is the scenario that the previous issue did not solve. Here, I register the replace file IDs involved in each replace commit to the active timeline (the replace commit timeline that has been submitted has saved partitionToReplaceFileIds, and only pending requests need to be processed). Since in the case of Spark SQL, delete partition creates a requested commit in advance during write, which is inconvenient to handle, I save the pending replace commit's partitionToReplaceFileIds information to the inflight commit's extra metadata. Therefore, each time drop partition or clustering is executed, it only needs to read the partitionToReplaceFileIds information in the timeline after ensuring that the inflight commit information has been saved to the timeline to ensure that there are no duplicate file IDs and prevent this kind of error from occurring. In simple terms, each replace commit will register the replace file ID information to the timeline whether it is submitted or not, at the same time, each submission will check this information to ensure that it will not be repeated, so that any replace commit containing this file ID will be prevented, ensuring that there are no duplicate keys. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] prathit06 commented on a diff in pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases
prathit06 commented on code in PR #9459: URL: https://github.com/apache/hudi/pull/9459#discussion_r1297953678 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java: ## @@ -479,4 +479,19 @@ private Map getGroupOffsets(KafkaConsumer consumer, Set
[GitHub] [hudi] yyh2954360585 closed issue #9471: [SUPPORT] When using Deltasteamer JdbcSource to extract data, there are issues with data loss and slow query of source side data
yyh2954360585 closed issue #9471: [SUPPORT] When using Deltasteamer JdbcSource to extract data, there are issues with data loss and slow query of source side data URL: https://github.com/apache/hudi/issues/9471 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yyh2954360585 opened a new issue, #9471: [SUPPORT] When using Deltasteamer JdbcSource to extract data, there are issues with data loss and slow query of source side data
yyh2954360585 opened a new issue, #9471: URL: https://github.com/apache/hudi/issues/9471 **Describe the problem you faced** Q1: Assuming the source table order table has a total data volume of 5 million. Synchronize using deltasteamer JdbcSource Hudi conf: ` --hoodie-conf hoodie.deltastreamer.jdbc.incr.pull=true` `--hoodie-conf hoodie.deltastreamer.jdbc.table.incr.column.name=update_date` `--source-limit 10` `--continuous` When deltasteamer synchronizes to 40w data, the current lastCheckpoint=2023-08-17 14:55 0:00:00 So the SQL for incrementalFetch Method to query source data is: `select (select * from order where update_date>"2023-08-17 14:55 0:00:00" order by update_date limit 10) rdbms_table` Assuming that there is 20 data in the updateDate field of my order table, which is equal to "2023-08-17 14:55 1:00:000" will only obtain 10 rows of data due to sourceLimit=10, and will also lose 10 rows of data. Q2: Why are these two parameters set? **To Reproduce** Steps to reproduce the behavior: 1. 2. 3. 4. **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version :0.13.1 * Spark version :3.2.1 * Hive version :3.1.3 * Hadoop version :3.3.3 * Storage (HDFS/S3/GCS..) :HDFS * Running on Docker? (yes/no) :no **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zyclove opened a new issue, #9470: [SUPPORT] spark-sql hudi 0.12.3 Caused by: org.apache.avro.AvroTypeException: Found long, expecting union
zyclove opened a new issue, #9470: URL: https://github.com/apache/hudi/issues/9470 **_Tips before filing an issue_** spark-sql query hudi table with the error. select count(1) from hudi_table ; **To Reproduce** Steps to reproduce the behavior: 1. hudi mor table 2. write data 3. query counts 4.error as follow ``` 1. Caused by: org.apache.avro.AvroTypeException: Found long, expecting union 2. Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong ``` **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version :0.12.3 * Spark version :3.2.1 * Hive version :3.1.2 * Hadoop version :3.2.2 * Storage (HDFS/S3/GCS..) :s3 * Running on Docker? (yes/no) :no **Additional context** Add any other context about the problem here. **Stacktrace** ``` 23/08/18 10:51:07 INFO TaskSetManager: Starting task 8.0 in stage 1.0 (TID 43) (172.30.15.96, executor 4, partition 8, PROCESS_LOCAL, 5126 bytes) taskResourceAssignments Map() 23/08/18 10:51:07 WARN TaskSetManager: Lost task 5.0 in stage 1.0 (TID 40) (172.30.15.96 executor 4): org.apache.hudi.exception.HoodieException: Exception when reading log file at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:377) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:220) at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:209) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:113) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:106) at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:343) at org.apache.hudi.LogFileIterator$.scanLog(LogFileIterator.scala:305) at org.apache.hudi.LogFileIterator.(LogFileIterator.scala:89) at org.apache.hudi.RecordMergingFileIterator.(LogFileIterator.scala:180) at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:104) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.avro.AvroTypeException: Found long, expecting union at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308) at org.apache.avro.io.parsing.Parser.advance(Parser.java:86) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:275) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:187) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:259) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247) at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.j
[GitHub] [hudi] danny0405 commented on a diff in pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive
danny0405 commented on code in PR #9416: URL: https://github.com/apache/hudi/pull/9416#discussion_r1297928994 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java: ## @@ -452,107 +431,137 @@ private Stream getCommitInstantsToArchive() throws IOException { ? CompactionUtils.getOldestInstantToRetainForCompaction( table.getActiveTimeline(), config.getInlineCompactDeltaCommitMax()) : Option.empty(); + oldestInstantToRetainCandidates.add(oldestInstantToRetainForCompaction); - // The clustering commit instant can not be archived unless we ensure that the replaced files have been cleaned, + // 3. The clustering commit instant can not be archived unless we ensure that the replaced files have been cleaned, // without the replaced files metadata on the timeline, the fs view would expose duplicates for readers. // Meanwhile, when inline or async clustering is enabled, we need to ensure that there is a commit in the active timeline // to check whether the file slice generated in pending clustering after archive isn't committed. Option oldestInstantToRetainForClustering = ClusteringUtils.getOldestInstantToRetainForClustering(table.getActiveTimeline(), table.getMetaClient()); + oldestInstantToRetainCandidates.add(oldestInstantToRetainForClustering); + + // 4. If metadata table is enabled, do not archive instants which are more recent than the last compaction on the + // metadata table. + if (table.getMetaClient().getTableConfig().isMetadataTableAvailable()) { +try (HoodieTableMetadata tableMetadata = HoodieTableMetadata.create(table.getContext(), config.getMetadataConfig(), config.getBasePath())) { + Option latestCompactionTime = tableMetadata.getLatestCompactionTime(); + if (!latestCompactionTime.isPresent()) { +LOG.info("Not archiving as there is no compaction yet on the metadata table"); +return Collections.emptyList(); + } else { +LOG.info("Limiting archiving of instants to latest compaction on metadata table at " + latestCompactionTime.get()); +oldestInstantToRetainCandidates.add(Option.of(new HoodieInstant( +HoodieInstant.State.COMPLETED, COMPACTION_ACTION, latestCompactionTime.get(; + } +} catch (Exception e) { + throw new HoodieException("Error limiting instant archival based on metadata table", e); +} + } + + // 5. If this is a metadata table, do not archive the commits that live in data set + // active timeline. This is required by metadata table, + // see HoodieTableMetadataUtil#processRollbackMetadata for details. + if (table.isMetadataTable()) { +HoodieTableMetaClient dataMetaClient = HoodieTableMetaClient.builder() + .setBasePath(HoodieTableMetadata.getDatasetBasePath(config.getBasePath())) +.setConf(metaClient.getHadoopConf()) +.build(); +Option qualifiedEarliestInstant = +TimelineUtils.getEarliestInstantForMetadataArchival( +dataMetaClient.getActiveTimeline(), config.shouldArchiveBeyondSavepoint()); + +// Do not archive the instants after the earliest commit (COMMIT, DELTA_COMMIT, and +// REPLACE_COMMIT only, considering non-savepoint commit only if enabling archive +// beyond savepoint) and the earliest inflight instant (all actions). +// This is required by metadata table, see HoodieTableMetadataUtil#processRollbackMetadata +// for details. +// Todo: Remove #7580 Review Comment: We should keep at least clean commit on the active timeline right? What about the rollback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs
[ https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6596: -- Epic Link: HUDI-1456 Reviewers: Sagar Sumit > Propose rollback implementation changes to guard against concurrent jobs > - > > Key: HUDI-6596 > URL: https://issues.apache.org/jira/browse/HUDI-6596 > Project: Apache Hudi > Issue Type: Wish >Reporter: Krishen Bhan >Priority: Trivial > Fix For: 1.0.0 > > > h1. Issue > The existing rollback API in 0.14 > [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877] > executes a rollback plan, either taking in an existing rollback plan > provided by the caller for a previous rollback or attempt, or scheduling a > new rollback instant if none is provided. Currently it is not safe for two > concurrent jobs to call this API (when skipLocking=False and the callers > aren't already holding a lock), as this can lead to an issue where multiple > rollback requested plans are created or two jobs are executing the same > rollback instant at the same time. > h1. Proposed change > One way to resolve this issue is to refactor this rollback function such that > if skipLocking=false, the following steps are followed > # Acquire the table lock > # Reload the active timeline > # Look at the active timeline to see if there is a inflight rollback instant > from a previous rollback attempt, if it exists then assign this is as the > rollback plan to execute. Also, check if a pending rollback plan was passed > in by caller. Then it executes the following steps depending on whether the > caller passed a pending rollback instant plan. > ## [a] If a pending inflight rollback plan was passed in by caller, then > check that there is a previous attempted rollback instant on timeline (and > that the instant times match) and continue to use this rollback plan. If that > isn't the case, then raise a rollback exception since this means another job > has concurrently already executed this plan. Note that in a valid HUDI > dataset there can be at most one rollback instant for a corresponding commit > instant, which is why if we no longer see a pending rollback in timeline in > this phase we can safely assume that it had already been executed to > completion. > ## [b] If no pending inflight rollback plan was passed in by caller and no > pending rollback instant was found in timeline earlier, then schedule a new > rollback plan > # Now that a rollback plan and requested rollback instant time has been > assigned, check for an active heartbeat for the rollback instant time. If > there is one, then abort the rollback as that means there is a concurrent job > executing that rollback. If not, then start a heartbeat for that rollback > instant time. > # Release the table lock > # Execute the rollback plan and complete the rollback instant. Regardless of > whether this succeeds or fails with an exception, close the heartbeat. This > increases the chance that the next job that tries to call this rollback API > will follow through with the rollback and not abort due to an active previous > heartbeat > > * These steps will only be enforced for skipLocking=false, since if > skipLocking=true then that means the caller may already be explicitly holding > a table lock. In this case, acquiring the lock again in step (1) will fail. > * Acquiring a lock and reloading timeline for (1-3) will guard against data > race conditions where another job calls this rollback API at same time and > schedules its own rollback plan and instant. This is since if no rollback has > been attempted before for this instant, then before step (1), there is a > window of time where another concurrent rollback job could have scheduled a > rollback plan, failed execution, and cleaned up heartbeat, all while the > current rollback job is running. As a result, even if the current job was > passed in an empty pending rollback plan, it still needs to check the active > timeline to ensure that no new rollback pending instant has been created. > * Using a heartbeat will signal to other callers in other jobs that there is > another job already executing this rollback. Checking for expired heartbeat > and (re)-starting the heartbeat has to be done under a lock, so that multiple > jobs don't each start it at the same time and assume that they are the only > ones that are heartbeating. > * The table lock is no longer needed after (5), since it can now be safely > assumed that no other job (calling this rollback API) will execute this > rollback instant. > One example implementation to achieve this: > > {code:java} > @Deprecated > public
[jira] [Updated] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs
[ https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-6596: -- Fix Version/s: 1.0.0 > Propose rollback implementation changes to guard against concurrent jobs > - > > Key: HUDI-6596 > URL: https://issues.apache.org/jira/browse/HUDI-6596 > Project: Apache Hudi > Issue Type: Wish >Reporter: Krishen Bhan >Priority: Trivial > Fix For: 1.0.0 > > > h1. Issue > The existing rollback API in 0.14 > [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877] > executes a rollback plan, either taking in an existing rollback plan > provided by the caller for a previous rollback or attempt, or scheduling a > new rollback instant if none is provided. Currently it is not safe for two > concurrent jobs to call this API (when skipLocking=False and the callers > aren't already holding a lock), as this can lead to an issue where multiple > rollback requested plans are created or two jobs are executing the same > rollback instant at the same time. > h1. Proposed change > One way to resolve this issue is to refactor this rollback function such that > if skipLocking=false, the following steps are followed > # Acquire the table lock > # Reload the active timeline > # Look at the active timeline to see if there is a inflight rollback instant > from a previous rollback attempt, if it exists then assign this is as the > rollback plan to execute. Also, check if a pending rollback plan was passed > in by caller. Then it executes the following steps depending on whether the > caller passed a pending rollback instant plan. > ## [a] If a pending inflight rollback plan was passed in by caller, then > check that there is a previous attempted rollback instant on timeline (and > that the instant times match) and continue to use this rollback plan. If that > isn't the case, then raise a rollback exception since this means another job > has concurrently already executed this plan. Note that in a valid HUDI > dataset there can be at most one rollback instant for a corresponding commit > instant, which is why if we no longer see a pending rollback in timeline in > this phase we can safely assume that it had already been executed to > completion. > ## [b] If no pending inflight rollback plan was passed in by caller and no > pending rollback instant was found in timeline earlier, then schedule a new > rollback plan > # Now that a rollback plan and requested rollback instant time has been > assigned, check for an active heartbeat for the rollback instant time. If > there is one, then abort the rollback as that means there is a concurrent job > executing that rollback. If not, then start a heartbeat for that rollback > instant time. > # Release the table lock > # Execute the rollback plan and complete the rollback instant. Regardless of > whether this succeeds or fails with an exception, close the heartbeat. This > increases the chance that the next job that tries to call this rollback API > will follow through with the rollback and not abort due to an active previous > heartbeat > > * These steps will only be enforced for skipLocking=false, since if > skipLocking=true then that means the caller may already be explicitly holding > a table lock. In this case, acquiring the lock again in step (1) will fail. > * Acquiring a lock and reloading timeline for (1-3) will guard against data > race conditions where another job calls this rollback API at same time and > schedules its own rollback plan and instant. This is since if no rollback has > been attempted before for this instant, then before step (1), there is a > window of time where another concurrent rollback job could have scheduled a > rollback plan, failed execution, and cleaned up heartbeat, all while the > current rollback job is running. As a result, even if the current job was > passed in an empty pending rollback plan, it still needs to check the active > timeline to ensure that no new rollback pending instant has been created. > * Using a heartbeat will signal to other callers in other jobs that there is > another job already executing this rollback. Checking for expired heartbeat > and (re)-starting the heartbeat has to be done under a lock, so that multiple > jobs don't each start it at the same time and assume that they are the only > ones that are heartbeating. > * The table lock is no longer needed after (5), since it can now be safely > assumed that no other job (calling this rollback API) will execute this > rollback instant. > One example implementation to achieve this: > > {code:java} > @Deprecated > public boolean rollback(final Str
[GitHub] [hudi] danny0405 merged pull request #9464: [MINOR] StreamerUtil#getTableConfig should check whether hoodie.properties exists
danny0405 merged PR #9464: URL: https://github.com/apache/hudi/pull/9464 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] StreamerUtil#getTableConfig should check whether hoodie.properties exists (#9464)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new ba5ab8ca468 [MINOR] StreamerUtil#getTableConfig should check whether hoodie.properties exists (#9464) ba5ab8ca468 is described below commit ba5ab8ca46863a67023e7172fb16a9a36d3b5acb Author: Nicholas Jiang AuthorDate: Fri Aug 18 10:03:12 2023 +0800 [MINOR] StreamerUtil#getTableConfig should check whether hoodie.properties exists (#9464) --- .../hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java index 4912c0abf03..842e732abd4 100644 --- a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java +++ b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java @@ -312,7 +312,7 @@ public class StreamerUtil { FileSystem fs = FSUtils.getFs(basePath, hadoopConf); Path metaPath = new Path(basePath, HoodieTableMetaClient.METAFOLDER_NAME); try { - if (fs.exists(metaPath)) { + if (fs.exists(new Path(metaPath, HoodieTableConfig.HOODIE_PROPERTIES_FILE))) { return Option.of(new HoodieTableConfig(fs, metaPath.toString(), null, null)); } } catch (IOException e) {
[hudi] branch master updated: [HUDI-6476][FOLLOW-UP] Path filter by FileStatus to avoid additional fs request (#9366)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 7fbf7a36690 [HUDI-6476][FOLLOW-UP] Path filter by FileStatus to avoid additional fs request (#9366) 7fbf7a36690 is described below commit 7fbf7a366900536053c4333dc7d6f4d0ad9b06b4 Author: Wechar Yu AuthorDate: Fri Aug 18 09:43:48 2023 +0800 [HUDI-6476][FOLLOW-UP] Path filter by FileStatus to avoid additional fs request (#9366) --- .../metadata/FileSystemBackedTableMetadata.java| 95 ++ 1 file changed, 41 insertions(+), 54 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java index b4a4da01977..8ea9861734a 100644 --- a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java +++ b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java @@ -54,6 +54,7 @@ import java.util.List; import java.util.Map; import java.util.concurrent.CopyOnWriteArrayList; import java.util.stream.Collectors; +import java.util.stream.Stream; /** * Implementation of {@link HoodieTableMetadata} based file-system-backed table metadata. @@ -167,66 +168,52 @@ public class FileSystemBackedTableMetadata extends AbstractHoodieTableMetadata { // TODO: Get the parallelism from HoodieWriteConfig int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size()); - // List all directories in parallel + // List all directories in parallel: + // if current dictionary contains PartitionMetadata, add it to result + // if current dictionary does not contain PartitionMetadata, add its subdirectory to queue to be processed. engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all partitions with prefix " + relativePathPrefix); - List dirToFileListing = engineContext.flatMap(pathsToList, path -> { + // result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. + // and second entry holds optionally a directory path to be processed further. + List, Option>> result = engineContext.flatMap(pathsToList, path -> { FileSystem fileSystem = path.getFileSystem(hadoopConf.get()); -return Arrays.stream(fileSystem.listStatus(path)); +if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) { + return Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), path)), Option.empty())); +} +return Arrays.stream(fileSystem.listStatus(path)) +.filter(status -> status.isDirectory() && !status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) +.map(status -> Pair.of(Option.empty(), Option.of(status.getPath(; }, listingParallelism); pathsToList.clear(); - // if current dictionary contains PartitionMetadata, add it to result - // if current dictionary does not contain PartitionMetadata, add it to queue to be processed. - int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, dirToFileListing.size()); - if (!dirToFileListing.isEmpty()) { -// result below holds a list of pair. first entry in the pair optionally holds the deduced list of partitions. -// and second entry holds optionally a directory path to be processed further. -engineContext.setJobStatus(this.getClass().getSimpleName(), "Processing listed partitions"); -List, Option>> result = engineContext.map(dirToFileListing, fileStatus -> { - FileSystem fileSystem = fileStatus.getPath().getFileSystem(hadoopConf.get()); - if (fileStatus.isDirectory()) { -if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, fileStatus.getPath())) { - return Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), fileStatus.getPath())), Option.empty()); -} else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) { - return Pair.of(Option.empty(), Option.of(fileStatus.getPath())); -} - } else if (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) { -String partitionName = FSUtils.getRelativePartitionPath(dataBasePath.get(), fileStatus.getPath().getParent()); -return Pair.of(Option.of(partitionName), Option.empty()); - } - return Pair.of(Option.empty(), Option.empty()); -}, fileListingParallelism); - -partitionPaths.addAll(result.stream().filter(entry -> entry.getKey().isPresent()) -.map(entry -> entry.getKey().get())
[GitHub] [hudi] danny0405 commented on a diff in pull request #9455: [WIP] Connection release fixes for RLI metadata
danny0405 commented on code in PR #9455: URL: https://github.com/apache/hudi/pull/9455#discussion_r1297898775 ## hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java: ## @@ -204,7 +212,9 @@ protected ClosableIterator getIndexedRecordIterator(Schema reader } catch (IOException e) { throw new HoodieIOException("Instantiation HfileScanner failed for " + reader.getHFileInfo().toString()); } -return new RecordIterator(scanner, getSchema(), readerSchema); +RecordIterator iterator = new RecordIterator(scanner, getSchema(), readerSchema); +recordIterators.add(iterator); Review Comment: Shouldn't these iterators be closed by the caller, we should check every invocation of the method and make sure the iterator got closed. Keepping all the references of the iterators is not a elegant way, maybe here we want to support multiple iterators on one reader, and want to reuse the reader each time, instead we should instantiate a new reader for each iterator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9455: [WIP] Connection release fixes for RLI metadata
danny0405 commented on code in PR #9455: URL: https://github.com/apache/hudi/pull/9455#discussion_r1297898775 ## hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java: ## @@ -204,7 +212,9 @@ protected ClosableIterator getIndexedRecordIterator(Schema reader } catch (IOException e) { throw new HoodieIOException("Instantiation HfileScanner failed for " + reader.getHFileInfo().toString()); } -return new RecordIterator(scanner, getSchema(), readerSchema); +RecordIterator iterator = new RecordIterator(scanner, getSchema(), readerSchema); +recordIterators.add(iterator); Review Comment: Shouldn't these iterators be closed by the caller, we should check every invocation of the method and make sure the iterator got closed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9455: [WIP] Connection release fixes for RLI metadata
danny0405 commented on code in PR #9455: URL: https://github.com/apache/hudi/pull/9455#discussion_r1297898113 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java: ## @@ -193,15 +212,33 @@ protected ClosableIterator> lookupRecords(List sorte blockContentLoc.getContentPositionInLogFile(), blockContentLoc.getBlockSize()); -final HoodieAvroHFileReader reader = +HoodieAvroHFileReader reader = new HoodieAvroHFileReader(inlineConf, inlinePath, new CacheConfig(inlineConf), inlinePath.getFileSystem(inlineConf), Option.of(getSchemaFromHeader())); // Get writer's schema from the header final ClosableIterator> recordIterator = fullKey ? reader.getRecordsByKeysIterator(sortedKeys, readerSchema) : reader.getRecordsByKeyPrefixIterator(sortedKeys, readerSchema); -return new CloseableMappingIterator<>(recordIterator, data -> (HoodieRecord) data); +ClosableIterator> iterator = new ClosableIterator>() { + @Override + public void close() { +recordIterator.close(); +reader.close(); Review Comment: ditto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9455: [WIP] Connection release fixes for RLI metadata
danny0405 commented on code in PR #9455: URL: https://github.com/apache/hudi/pull/9455#discussion_r1297897760 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java: ## @@ -175,7 +175,26 @@ protected ClosableIterator> deserializeRecords(byte[] conten FileSystem fs = FSUtils.getFs(pathForReader.toString(), FSUtils.buildInlineConf(getBlockContentLocation().get().getHadoopConf())); // Read the content HoodieAvroHFileReader reader = new HoodieAvroHFileReader(fs, pathForReader, content, Option.of(getSchemaFromHeader())); -return unsafeCast(reader.getRecordIterator(readerSchema)); + +ClosableIterator> recordIterator = reader.getRecordIterator(readerSchema); +ClosableIterator> iterator = new ClosableIterator>() { + @Override + public void close() { +recordIterator.close(); +reader.close(); Review Comment: Isn't the `recordIterator.close()` just closing the reader? What nested in another iterator? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
hudi-bot commented on PR #9422: URL: https://github.com/apache/hudi/pull/9422#issuecomment-1683204951 ## CI report: * 751b8aca531eb397d30fd95637bcf7a1e97a6c08 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19324) * a0db166250fe0220494b18b0c0d343d1a3adae7b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19342) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases
danny0405 commented on code in PR #9459: URL: https://github.com/apache/hudi/pull/9459#discussion_r1297895255 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java: ## @@ -479,4 +479,19 @@ private Map getGroupOffsets(KafkaConsumer consumer, Set
[GitHub] [hudi] danny0405 commented on a diff in pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases
danny0405 commented on code in PR #9459: URL: https://github.com/apache/hudi/pull/9459#discussion_r1297894861 ## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java: ## @@ -479,4 +479,19 @@ private Map getGroupOffsets(KafkaConsumer consumer, Set
[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly
hudi-bot commented on PR #9422: URL: https://github.com/apache/hudi/pull/9422#issuecomment-1683197113 ## CI report: * 751b8aca531eb397d30fd95637bcf7a1e97a6c08 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19324) * a0db166250fe0220494b18b0c0d343d1a3adae7b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] BBency commented on issue #9094: Async Clustering failing with errors for MOR table
BBency commented on issue #9094: URL: https://github.com/apache/hudi/issues/9094#issuecomment-1683116028 I was able to make the clustering work on a test job, but it is failing when I apply the same clustering configs on the production table. It is failing with the error: py4j.protocol.Py4JJavaError: An error occurred while calling o97.sql. : org.apache.hudi.exception.HoodieClusteringException: **Clustering failed to write to files:**3b43f625-3095-4834-ab45-beade1dbbfa5-0 at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:381) at org.apache.hudi.client.SparkRDDWriteClient.completeTableService(SparkRDDWriteClient.java:468) What parameters should I consider while specifying the values for hoodie.clustering.plan.strategy.max.num.groups, hoodie.clustering.plan.strategy.small.file.limit, hoodie.clustering.plan.strategy.target.file.max.bytes and hoodie.clustering.plan.strategy.max.bytes.per.group. Can your provide some guidance on the same please -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario
hudi-bot commented on PR #9468: URL: https://github.com/apache/hudi/pull/9468#issuecomment-1683059173 ## CI report: * ac44e8c1ee6266c53a613ec96dbd89a7223da4c7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19341) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-4756) Clean up usages of "assume.date.partition" config within hudi
[ https://issues.apache.org/jira/browse/HUDI-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755708#comment-17755708 ] Lin Liu commented on HUDI-4756: --- Talked with [~shivnarayan] offline, who confirmed that this task has been there for a while, and it is a bit tricky to figure out if this configuration has been used in prod. Therefore, we will check the impact of this configuration in few cases in next step: # Set it to false for non-date-partitioned table, and then set it to true. # Set it to true for date-partitioned table and then set it to false. # Set it to false for date-partitioned table and then set it to true. # Set it to true for date-partitioned table and then set it to false. > Clean up usages of "assume.date.partition" config within hudi > - > > Key: HUDI-4756 > URL: https://issues.apache.org/jira/browse/HUDI-4756 > Project: Apache Hudi > Issue Type: Improvement > Components: configs >Reporter: sivabalan narayanan >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > looks like "assume.date.partition" is not used anywhere within hudi. lets > clean up the usages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files
[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-6712: -- Status: In Progress (was: Open) > Implement optimized keyed lookup on parquet files > - > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List lookup(parquetFile, Set keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] kazdy closed pull request #7547: [DOCS] add DROP TABLE, TRUNCATE TABLE docs to spark quick start guide, minor syntax fixes to ALTER TABLE docs
kazdy closed pull request #7547: [DOCS] add DROP TABLE, TRUNCATE TABLE docs to spark quick start guide, minor syntax fixes to ALTER TABLE docs URL: https://github.com/apache/hudi/pull/7547 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kazdy closed pull request #7548: [DOCS] fix when I click on Update or MergeInto link in spark quickstart it d…
kazdy closed pull request #7548: [DOCS] fix when I click on Update or MergeInto link in spark quickstart it d… URL: https://github.com/apache/hudi/pull/7548 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs
[ https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751866#comment-17751866 ] Krishen Bhan edited comment on HUDI-6596 at 8/17/23 7:47 PM: - I was going to create my PR [https://github.com/kbuci/hudi/pull/2] for this change on the hudi repo, but realized there was an issue since the assumptions made in the rollback implementation (both the existing one and my proposed change) where {{org.apache.hudi.client.BaseHoodieTableServiceClient#rollback(org.apache.hudi.table.HoodieTable, java.lang.String, org.apache.hudi.common.util.Option, java.lang.String)}} is inconsistent with the changes here in [https://github.com/apache/hudi/pull/8849] Specifically, {{org.apache.hudi.client.BaseHoodieTableServiceClient#rollback(org.apache.hudi.table.HoodieTable, java.lang.String, org.apache.hudi.common.util.Option, java.lang.String)}} seems to have been implemented (base on code and comments) under the assumption that a rollback operation will delete all instant files from {{commit instant to rollback}} before completing the rollback operation itself, which is what I had thought when I was working on my rollback fix(es). But it seems that after [https://github.com/apache/hudi/pull/8849] this is (retroactively) incorrect as now we are deleting instant files from {{commit instant to rollback}} after completing the rollback instant, leaving rollback operation as a special type of case where it is possible for rollback instant to be complete even if the actual rollback operation has not fully completed (due to failing after completing the rollback instant but before cleaning up instant files of `{{{}commit instant to rollback{}}} ). Although [https://github.com/apache/hudi/pull/8849] handles this by delegating the deleting of instant files from `{{{}commit instant to rollback{}}} to some clean rollbackFailedWrites operation, I think the intention/invariants/rules of how rollback operates is a bit ambiguous to me and something that should be reconciled. To further add complexity, it seems that based on [https://github.com/kbuci/hudi/blob/35be9bbbc7ef7ae6ad0a4955da78da4c0463074f/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L630] it is also currently legal to remove a request rollback plan, in other words "rolling back" a pending rollback plan. Also after taking another look at the reason for [https://github.com/apache/hudi/pull/8849] I think the fix there can be reverted and handled alternatively, since it seems to me that fixing/finding bugs with getPendingRollbackInfo and preventing concurrent rollback scheduling/execution might prevent the underlying issue/reason for PR in the first place was (Author: JIRAUSER301521): I was going to create my PR [https://github.com/kbuci/hudi/pull/2] for this change on the hudi repo, but realized there was an issue since the assumptions made in the rollback implementation (both the existing one and my proposed change) where {{org.apache.hudi.client.BaseHoodieTableServiceClient#rollback(org.apache.hudi.table.HoodieTable, java.lang.String, org.apache.hudi.common.util.Option, java.lang.String)}} is inconsistent with the changes here in [https://github.com/apache/hudi/pull/8849] Specifically, {{org.apache.hudi.client.BaseHoodieTableServiceClient#rollback(org.apache.hudi.table.HoodieTable, java.lang.String, org.apache.hudi.common.util.Option, java.lang.String)}} seems to have been implemented (base on code and comments) under the assumption that a rollback operation will delete all instant files from {{commit instant to rollback}} before completing the rollback operation itself, which is what I had thought when I was working on my rollback fix(es). But it seems that after [https://github.com/apache/hudi/pull/8849] this is (retroactively) incorrect as now we are deleting instant files from {{commit instant to rollback}} after completing the rollback instant, leaving rollback operation as a special type of case where it is possible for rollback instant to be complete even if the actual rollback operation has not fully completed (due to failing after completing the rollback instant but before cleaning up instant files of `{{{}commit instant to rollback{}}} ). Although [https://github.com/apache/hudi/pull/8849] handles this by delegating the deleting of instant files from `{{{}commit instant to rollback{}}} to some clean rollbackFailedWrites operation, I think the intention/invariants/rules of how rollback operates is a bit ambiguous to me and something that should be reconciled. Also after taking another look at the reason for [https://github.com/apache/hudi/pull/8849] I think the fix there can be reverted and handled alternatively, since it seems to me that fixing/finding bugs with getPendingRollbackInfo and prevent
[GitHub] [hudi] praneethh opened a new issue, #9469: [SUPPORT] Exception when using MERGE INTO
praneethh opened a new issue, #9469: URL: https://github.com/apache/hudi/issues/9469 I'm trying to use merge into and perform partial update on the target data but getting the following error: ``` java.lang.UnsupportedOperationException: MERGE INTO TABLE is not supported temporarily. at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:718) at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:67) at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) ``` Steps to reproduce: 1) Load the target table ``` val df = Seq(("1","neo","2023-08-04 12:00:00","2023-08-04 12:00:00","2023-08-04")).toDF("emp_id", "emp_name", "log_ts", "load_ts", "log_dt") df.select(col("emp_id").cast("int"),col("emp_name").cast("string"),col("log_ts").cast("timestamp"),col("load_ts").cast("timestamp"),col("log_dt").cast("date")) res0.write.format("hudi") .option("hoodie.payload.ordering.field", "load_ts") .option("hoodie.datasource.write.recordkey.field", "emp_id") .option("hoodie.datasource.write.partitionpath.field", "log_dt") .option("hoodie.index.type","GLOBAL_SIMPLE") .option("hoodie.table.name", "hudi_test") .option("hoodie.simple.index.update.partition.path", "false") .option("hoodie.datasource.write.precombine.field", "load_ts") .option("hoodie.datasource.write.payload.class","org.apache.hudi.common.model.PartialUpdateAvroPayload") .option("hoodie.datasource.write.reconcile.schema","true") .option("hoodie.schema.on.read.enable","true") .option("hoodie.datasource.write.hive_style_partitioning", "true") .option("hoodie.datasource.write.row.writer.enable","false") .option("hoodie.datasource.hive_sync.enable","true") .option("hoodie.datasource.hive_sync.database","pharpan") .option("hoodie.datasource.hive_sync.table", "hudi_test") .option("hoodie.datasource.hive_sync.partition_fields", "partitionId") .option("hoodie.datasource.hive_sync.ignore_exceptions", "true") .option("hoodie.datasource.hive_sync.mode", "hms") .option("hoodie.datasource.hive_sync.use_jdbc", "false") .option("hoodie.datasource.write.operation","upsert") .mode("append") .save("gs://sample_bucket/hudi_sample_output_data") ``` 2) Load the incremental data ``` val df2 = Seq(("1","neo","2023-08-05 14:00:00","2023-08-04 12:00:00","2023-08-05"),("2","trinity","2023-08-05 14:00:00","2023-08-05 15:00:00","2023-08-05")).toDF("emp_id", "emp_name", "log_ts","load_ts","log_dt") df2.select(col("emp_id").cast("int"),col("emp_name").cast("string"),col("log_ts").cast("timestamp"),col("load_ts").cast("timestamp"),col("log_dt").cast("date")) res2.createOrReplaceTempView("incremental_data") ``` 3) Perform merge ``` val sqlPartialUpdate = s""" | merge into pharpan.hudi_test as target | using ( | select * from incremental_data | ) source | on target.emp_id = source.emp_id | when matched then | update set target.log_ts = source.log_ts, target.log_dt = source.log_dt | when not matched then insert * """.stripMargin spark.sql(sqlPartialUpdate) ``` Hudi verison: 0.13.1 Using "org.apache.hudi.common.model.PartialUpdateAvroPayload" for partial update. Can someone please help in resolving this error? Also, please share the documentation on using MERGE INTO if I'm using it in the wrong way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases
hudi-bot commented on PR #9459: URL: https://github.com/apache/hudi/pull/9459#issuecomment-1682849353 ## CI report: * 170678f0e7c429406a4565d85e77367908c1fb4b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19340) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs
[ https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755672#comment-17755672 ] Krishen Bhan commented on HUDI-6596: {quote} I think we should use a different name to skipLocking, if acquiring the lock is skipped because we already acquired the lock, then we should use a different variable something like isLockAcquired or something. {quote} I think that was the convention I noticed, but sure I can address that once I post the PR for review, thanks! {quote} Without complicating the rollback logic, let us see all the cases where we use rollback. 1. Rollback failed writes: Lock has to be acquired until scheduling the rollback plans for pending instantsToRollback and for execution it need not acquire a lock. 2. Rollback a specific instant: Only schedule step needs to be under a lock. 3. Restore operation: Entire operation needs to be under a lock. For rollbackFailedWrites method, break it down to two Stages Stage 1: Scheduling stage Step 1: Acquire lock and reload active timeline Step 2: getInstantsToRollback Step 3: removeInflightFilesAlreadyRolledBack Step 4: getPendingRollbackInfos Step 5: Use existing plan or schedule rollback Step 6: Release lock Stage 2: Execution stage Step 7: Check if heartbeat exist for pending rollback plan. If yes abort else start an heartbeat and proceed further for executing it. {quote} For now in this ticket the intention is to just focus on (2) `Rollback a specific instant:` . Depending on how this implementation goes, I think we could follow your approach for (1) `rollbackFailedWrites` when I create a ticket to address that. Sorry, I should rename the name of this JIRA ticket to clarify that. {quote} Rollback operation are not that common. We only do rollback if something fails. So, it is not like .clean or .commit operations. So, we should be ok in seeing some noise. {quote} The issue is that although the chance of an individual job transiently failing on a upsert is low, as we add more concurrent writers to our pool of upsert jobs on a dataset, the chance that at least one upsert job will fail increases. In addition there is the case of underlying infrastructure (like Spark/YARN) service degradations (that we've seen internally in our organization) it's possible that all writers might fail during an upsert/rollback in the same window of time. This means that we should try to gracefully/resiliently account for a chance that there is a concurrent rollback going on during a job's upsert operation, or even a concurrent rollback that itself has failed. Although locking the table during a rollback is out of the question, we can still go with an approach like I suggested in https://issues.apache.org/jira/browse/HUDI-6596?focusedCommentId=17751201&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17751201 , to greatly reduce the chance that some sporadic rollback/failures will cause all concurrent upsert jobs to fail. > Propose rollback implementation changes to guard against concurrent jobs > - > > Key: HUDI-6596 > URL: https://issues.apache.org/jira/browse/HUDI-6596 > Project: Apache Hudi > Issue Type: Wish >Reporter: Krishen Bhan >Priority: Trivial > > h1. Issue > The existing rollback API in 0.14 > [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877] > executes a rollback plan, either taking in an existing rollback plan > provided by the caller for a previous rollback or attempt, or scheduling a > new rollback instant if none is provided. Currently it is not safe for two > concurrent jobs to call this API (when skipLocking=False and the callers > aren't already holding a lock), as this can lead to an issue where multiple > rollback requested plans are created or two jobs are executing the same > rollback instant at the same time. > h1. Proposed change > One way to resolve this issue is to refactor this rollback function such that > if skipLocking=false, the following steps are followed > # Acquire the table lock > # Reload the active timeline > # Look at the active timeline to see if there is a inflight rollback instant > from a previous rollback attempt, if it exists then assign this is as the > rollback plan to execute. Also, check if a pending rollback plan was passed > in by caller. Then it executes the following steps depending on whether the > caller passed a pending rollback instant plan. > ## [a] If a pending inflight rollback plan was passed in by caller, then > check that there is a previous attempted rollback instant on timeline (and > that the instant times match) and continue to use this rollback plan.
[jira] [Commented] (HUDI-4756) Clean up usages of "assume.date.partition" config within hudi
[ https://issues.apache.org/jira/browse/HUDI-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755664#comment-17755664 ] Lin Liu commented on HUDI-4756: --- [~shivnarayan], would you please shed some light on the background of this task, e.g., why isn't this configuration used? how to check if all the usage of this configuration has been removed in our products? Thanks. > Clean up usages of "assume.date.partition" config within hudi > - > > Key: HUDI-4756 > URL: https://issues.apache.org/jira/browse/HUDI-4756 > Project: Apache Hudi > Issue Type: Improvement > Components: configs >Reporter: sivabalan narayanan >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > looks like "assume.date.partition" is not used anywhere within hudi. lets > clean up the usages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario
hudi-bot commented on PR #9468: URL: https://github.com/apache/hudi/pull/9468#issuecomment-1682773010 ## CI report: * ac44e8c1ee6266c53a613ec96dbd89a7223da4c7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19341) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario
hudi-bot commented on PR #9468: URL: https://github.com/apache/hudi/pull/9468#issuecomment-1682762105 ## CI report: * ac44e8c1ee6266c53a613ec96dbd89a7223da4c7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6718) Concurrent cleaner commit same instance conflict
[ https://issues.apache.org/jira/browse/HUDI-6718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-6718: -- Status: Patch Available (was: In Progress) > Concurrent cleaner commit same instance conflict > - > > Key: HUDI-6718 > URL: https://issues.apache.org/jira/browse/HUDI-6718 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning, multi-writer, table-service >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Timeline > > {code:java} > -rw-r--r-- 1 jon wheel 0B Aug 16 19:58 > 20230816195843234.commit.requested > -rw-r--r-- 1 jon wheel 0B Aug 16 19:58 > 20230816195845557.commit.requested > -rw-r--r-- 1 jon wheel 2.2K Aug 16 19:58 20230816195843234.inflight > -rw-r--r-- 1 jon wheel 813B Aug 16 19:58 20230816195845557.inflight > -rw-r--r-- 1 jon wheel 2.6K Aug 16 19:58 20230816195845557.commit > -rw-r--r-- 1 jon wheel 2.6K Aug 16 19:58 20230816195843234.commit > -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 > 20230816195855285.clean.requested > -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 20230816195855285.clean.inflight > -rw-r--r-- 1 jon wheel 1.8K Aug 16 19:58 > 20230816195855389.clean.requested > -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 20230816195855285.clean {code} > requests: > {code:java} > avrocat hudi/output/.hoodie/20230816195855285.clean.requested > {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": > "20230816195654386", "action": "commit", "state": "COMPLETED"}}, > "lastCompletedCommitTimestamp": "20230816195845557", "policy": > "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, > "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": > {"1970/01/01": [{"filePath": {"string": > "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"}, > "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": > {"array": []}} {code} > {code:java} > avrocat hudi/output/.hoodie/20230816195855389.clean.requested > {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": > "20230816195704584", "action": "commit", "state": "COMPLETED"}}, > "lastCompletedCommitTimestamp": "20230816195845557", "policy": > "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, > "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": > {"1970/01/01": [{"filePath": {"string": > "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"}, > "isBootstrapBaseFile": {"boolean": false}}], "1970/01/20": [{"filePath": > {"string": > "file:/tmp/hudi/output/1970/01/20/05942caf-2d53-4345-845c-5e42abaca797-0_0-1454-2121_20230816195635690.parquet"}, > "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": > {"array": []}} > {code} > Console output: > notice transaction starts twice for the same instance > {code:java} > 424775 [pool-75-thread-1] INFO > org.apache.hudi.table.action.clean.CleanActionExecutor [] - Finishing > previously unfinished cleaner > instant=[==>20230816195855285__clean__INFLIGHT__20230816195855525] > 424775 [pool-75-thread-1] INFO > org.apache.hudi.table.action.clean.CleanActionExecutor [] - Using > cleanerParallelism: 1 > 424779 [pool-91-thread-1] INFO > org.apache.hudi.common.table.timeline.HoodieActiveTimeline [] - Loaded > instants upto : > Option{val=[==>20230816195855389__clean__REQUESTED__20230816195855634]} > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.TransactionManager [] - Transaction > starting for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest > completed transaction instant Optional.empty > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.lock.LockManager [] - LockProvider > org.apache.hudi.client.transaction.lock.InProcessLockProvider > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > file:/tmp/hudi/output, Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 0, > Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRING > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > file:/tmp/hudi/output, Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 1, > Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRED > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.TransactionManager [] - Transaction > started for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest > complet
[jira] [Updated] (HUDI-6718) Concurrent cleaner commit same instance conflict
[ https://issues.apache.org/jira/browse/HUDI-6718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-6718: -- Status: In Progress (was: Open) > Concurrent cleaner commit same instance conflict > - > > Key: HUDI-6718 > URL: https://issues.apache.org/jira/browse/HUDI-6718 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning, multi-writer, table-service >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Timeline > > {code:java} > -rw-r--r-- 1 jon wheel 0B Aug 16 19:58 > 20230816195843234.commit.requested > -rw-r--r-- 1 jon wheel 0B Aug 16 19:58 > 20230816195845557.commit.requested > -rw-r--r-- 1 jon wheel 2.2K Aug 16 19:58 20230816195843234.inflight > -rw-r--r-- 1 jon wheel 813B Aug 16 19:58 20230816195845557.inflight > -rw-r--r-- 1 jon wheel 2.6K Aug 16 19:58 20230816195845557.commit > -rw-r--r-- 1 jon wheel 2.6K Aug 16 19:58 20230816195843234.commit > -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 > 20230816195855285.clean.requested > -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 20230816195855285.clean.inflight > -rw-r--r-- 1 jon wheel 1.8K Aug 16 19:58 > 20230816195855389.clean.requested > -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 20230816195855285.clean {code} > requests: > {code:java} > avrocat hudi/output/.hoodie/20230816195855285.clean.requested > {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": > "20230816195654386", "action": "commit", "state": "COMPLETED"}}, > "lastCompletedCommitTimestamp": "20230816195845557", "policy": > "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, > "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": > {"1970/01/01": [{"filePath": {"string": > "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"}, > "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": > {"array": []}} {code} > {code:java} > avrocat hudi/output/.hoodie/20230816195855389.clean.requested > {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": > "20230816195704584", "action": "commit", "state": "COMPLETED"}}, > "lastCompletedCommitTimestamp": "20230816195845557", "policy": > "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, > "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": > {"1970/01/01": [{"filePath": {"string": > "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"}, > "isBootstrapBaseFile": {"boolean": false}}], "1970/01/20": [{"filePath": > {"string": > "file:/tmp/hudi/output/1970/01/20/05942caf-2d53-4345-845c-5e42abaca797-0_0-1454-2121_20230816195635690.parquet"}, > "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": > {"array": []}} > {code} > Console output: > notice transaction starts twice for the same instance > {code:java} > 424775 [pool-75-thread-1] INFO > org.apache.hudi.table.action.clean.CleanActionExecutor [] - Finishing > previously unfinished cleaner > instant=[==>20230816195855285__clean__INFLIGHT__20230816195855525] > 424775 [pool-75-thread-1] INFO > org.apache.hudi.table.action.clean.CleanActionExecutor [] - Using > cleanerParallelism: 1 > 424779 [pool-91-thread-1] INFO > org.apache.hudi.common.table.timeline.HoodieActiveTimeline [] - Loaded > instants upto : > Option{val=[==>20230816195855389__clean__REQUESTED__20230816195855634]} > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.TransactionManager [] - Transaction > starting for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest > completed transaction instant Optional.empty > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.lock.LockManager [] - LockProvider > org.apache.hudi.client.transaction.lock.InProcessLockProvider > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > file:/tmp/hudi/output, Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 0, > Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRING > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > file:/tmp/hudi/output, Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 1, > Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRED > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.TransactionManager [] - Transaction > started for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest > completed transact
[jira] [Updated] (HUDI-6718) Concurrent cleaner commit same instance conflict
[ https://issues.apache.org/jira/browse/HUDI-6718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6718: - Labels: pull-request-available (was: ) > Concurrent cleaner commit same instance conflict > - > > Key: HUDI-6718 > URL: https://issues.apache.org/jira/browse/HUDI-6718 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning, multi-writer, table-service >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > > Timeline > > {code:java} > -rw-r--r-- 1 jon wheel 0B Aug 16 19:58 > 20230816195843234.commit.requested > -rw-r--r-- 1 jon wheel 0B Aug 16 19:58 > 20230816195845557.commit.requested > -rw-r--r-- 1 jon wheel 2.2K Aug 16 19:58 20230816195843234.inflight > -rw-r--r-- 1 jon wheel 813B Aug 16 19:58 20230816195845557.inflight > -rw-r--r-- 1 jon wheel 2.6K Aug 16 19:58 20230816195845557.commit > -rw-r--r-- 1 jon wheel 2.6K Aug 16 19:58 20230816195843234.commit > -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 > 20230816195855285.clean.requested > -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 20230816195855285.clean.inflight > -rw-r--r-- 1 jon wheel 1.8K Aug 16 19:58 > 20230816195855389.clean.requested > -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 20230816195855285.clean {code} > requests: > {code:java} > avrocat hudi/output/.hoodie/20230816195855285.clean.requested > {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": > "20230816195654386", "action": "commit", "state": "COMPLETED"}}, > "lastCompletedCommitTimestamp": "20230816195845557", "policy": > "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, > "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": > {"1970/01/01": [{"filePath": {"string": > "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"}, > "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": > {"array": []}} {code} > {code:java} > avrocat hudi/output/.hoodie/20230816195855389.clean.requested > {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": > "20230816195704584", "action": "commit", "state": "COMPLETED"}}, > "lastCompletedCommitTimestamp": "20230816195845557", "policy": > "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, > "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": > {"1970/01/01": [{"filePath": {"string": > "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"}, > "isBootstrapBaseFile": {"boolean": false}}], "1970/01/20": [{"filePath": > {"string": > "file:/tmp/hudi/output/1970/01/20/05942caf-2d53-4345-845c-5e42abaca797-0_0-1454-2121_20230816195635690.parquet"}, > "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": > {"array": []}} > {code} > Console output: > notice transaction starts twice for the same instance > {code:java} > 424775 [pool-75-thread-1] INFO > org.apache.hudi.table.action.clean.CleanActionExecutor [] - Finishing > previously unfinished cleaner > instant=[==>20230816195855285__clean__INFLIGHT__20230816195855525] > 424775 [pool-75-thread-1] INFO > org.apache.hudi.table.action.clean.CleanActionExecutor [] - Using > cleanerParallelism: 1 > 424779 [pool-91-thread-1] INFO > org.apache.hudi.common.table.timeline.HoodieActiveTimeline [] - Loaded > instants upto : > Option{val=[==>20230816195855389__clean__REQUESTED__20230816195855634]} > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.TransactionManager [] - Transaction > starting for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest > completed transaction instant Optional.empty > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.lock.LockManager [] - LockProvider > org.apache.hudi.client.transaction.lock.InProcessLockProvider > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > file:/tmp/hudi/output, Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 0, > Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRING > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path > file:/tmp/hudi/output, Lock Instance > java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 1, > Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRED > 424779 [pool-91-thread-1] INFO > org.apache.hudi.client.transaction.TransactionManager [] - Transaction > started for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest > completed tra
[GitHub] [hudi] jonvex opened a new pull request, #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario
jonvex opened a new pull request, #9468: URL: https://github.com/apache/hudi/pull/9468 ### Change Logs If two cleans start at nearly the same time, they will both attempt to execute the same clean instances. This does not cause any data corruption, but will cause a writer to fail when they attempt to create the commit in the timeline. This is because the commit will have already been written by the first writer. Now, we check the timeline before transitioning state. ### Impact No writers will fail in this scenario now ### Risk level (write none, low medium or high below) low ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6718) Concurrent cleaner commit same instance conflict
[ https://issues.apache.org/jira/browse/HUDI-6718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-6718: -- Description: Timeline {code:java} -rw-r--r-- 1 jon wheel 0B Aug 16 19:58 20230816195843234.commit.requested -rw-r--r-- 1 jon wheel 0B Aug 16 19:58 20230816195845557.commit.requested -rw-r--r-- 1 jon wheel 2.2K Aug 16 19:58 20230816195843234.inflight -rw-r--r-- 1 jon wheel 813B Aug 16 19:58 20230816195845557.inflight -rw-r--r-- 1 jon wheel 2.6K Aug 16 19:58 20230816195845557.commit -rw-r--r-- 1 jon wheel 2.6K Aug 16 19:58 20230816195843234.commit -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 20230816195855285.clean.requested -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 20230816195855285.clean.inflight -rw-r--r-- 1 jon wheel 1.8K Aug 16 19:58 20230816195855389.clean.requested -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 20230816195855285.clean {code} requests: {code:java} avrocat hudi/output/.hoodie/20230816195855285.clean.requested {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": "20230816195654386", "action": "commit", "state": "COMPLETED"}}, "lastCompletedCommitTimestamp": "20230816195845557", "policy": "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": {"1970/01/01": [{"filePath": {"string": "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"}, "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": {"array": []}} {code} {code:java} avrocat hudi/output/.hoodie/20230816195855389.clean.requested {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": "20230816195704584", "action": "commit", "state": "COMPLETED"}}, "lastCompletedCommitTimestamp": "20230816195845557", "policy": "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": {"1970/01/01": [{"filePath": {"string": "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"}, "isBootstrapBaseFile": {"boolean": false}}], "1970/01/20": [{"filePath": {"string": "file:/tmp/hudi/output/1970/01/20/05942caf-2d53-4345-845c-5e42abaca797-0_0-1454-2121_20230816195635690.parquet"}, "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": {"array": []}} {code} Console output: notice transaction starts twice for the same instance {code:java} 424775 [pool-75-thread-1] INFO org.apache.hudi.table.action.clean.CleanActionExecutor [] - Finishing previously unfinished cleaner instant=[==>20230816195855285__clean__INFLIGHT__20230816195855525] 424775 [pool-75-thread-1] INFO org.apache.hudi.table.action.clean.CleanActionExecutor [] - Using cleanerParallelism: 1 424779 [pool-91-thread-1] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline [] - Loaded instants upto : Option{val=[==>20230816195855389__clean__REQUESTED__20230816195855634]} 424779 [pool-91-thread-1] INFO org.apache.hudi.client.transaction.TransactionManager [] - Transaction starting for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest completed transaction instant Optional.empty 424779 [pool-91-thread-1] INFO org.apache.hudi.client.transaction.lock.LockManager [] - LockProvider org.apache.hudi.client.transaction.lock.InProcessLockProvider 424779 [pool-91-thread-1] INFO org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path file:/tmp/hudi/output, Lock Instance java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 0, Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRING 424779 [pool-91-thread-1] INFO org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path file:/tmp/hudi/output, Lock Instance java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 1, Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRED 424779 [pool-91-thread-1] INFO org.apache.hudi.client.transaction.TransactionManager [] - Transaction started for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest completed transaction instant Optional.empty {code} The following pr exposed the issue [https://github.com/apache/hudi/pull/8602] This does not cause data corruption. Writer needs to be restarted was: Timeline {code:java} -rw-r--r-- 1 jon wheel 0B Aug 16 19:58 20230816195843234.commit.requested -rw-r--r-- 1 jon wheel 0B Aug 16 19:58 20230816195845557.commit.requested -rw-r--r-- 1 jon wheel 2.2K Aug 16 19:58 20230816195843234.inflight -rw-r--r-- 1 jon wheel 813B Aug 16 19:58 20230816195845557.inflight -rw-r--r-- 1 jon wheel 2.6K Aug 16 19:58 20230816195845557.commit -rw-r--r-- 1 jon wheel 2.6K Aug 16 19:58 20230816195843234.commit -rw-r-
[jira] [Created] (HUDI-6718) Concurrent cleaner commit same instance conflict
Jonathan Vexler created HUDI-6718: - Summary: Concurrent cleaner commit same instance conflict Key: HUDI-6718 URL: https://issues.apache.org/jira/browse/HUDI-6718 Project: Apache Hudi Issue Type: Bug Components: cleaning, multi-writer, table-service Reporter: Jonathan Vexler Assignee: Jonathan Vexler Timeline {code:java} -rw-r--r-- 1 jon wheel 0B Aug 16 19:58 20230816195843234.commit.requested -rw-r--r-- 1 jon wheel 0B Aug 16 19:58 20230816195845557.commit.requested -rw-r--r-- 1 jon wheel 2.2K Aug 16 19:58 20230816195843234.inflight -rw-r--r-- 1 jon wheel 813B Aug 16 19:58 20230816195845557.inflight -rw-r--r-- 1 jon wheel 2.6K Aug 16 19:58 20230816195845557.commit -rw-r--r-- 1 jon wheel 2.6K Aug 16 19:58 20230816195843234.commit -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 20230816195855285.clean.requested -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 20230816195855285.clean.inflight -rw-r--r-- 1 jon wheel 1.8K Aug 16 19:58 20230816195855389.clean.requested -rw-r--r-- 1 jon wheel 1.7K Aug 16 19:58 20230816195855285.clean {code} requests: {code:java} avrocat hudi/output/.hoodie/20230816195855285.clean.requested {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": "20230816195654386", "action": "commit", "state": "COMPLETED"}}, "lastCompletedCommitTimestamp": "20230816195845557", "policy": "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": {"1970/01/01": [{"filePath": {"string": "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"}, "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": {"array": []}} {code} {code:java} avrocat hudi/output/.hoodie/20230816195855389.clean.requested {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": "20230816195704584", "action": "commit", "state": "COMPLETED"}}, "lastCompletedCommitTimestamp": "20230816195845557", "policy": "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": {"1970/01/01": [{"filePath": {"string": "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"}, "isBootstrapBaseFile": {"boolean": false}}], "1970/01/20": [{"filePath": {"string": "file:/tmp/hudi/output/1970/01/20/05942caf-2d53-4345-845c-5e42abaca797-0_0-1454-2121_20230816195635690.parquet"}, "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": {"array": []}} {code} The following pr exposed the issue [https://github.com/apache/hudi/pull/8602] This does not cause data corruption. Writer needs to be restarted -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9467: [HUDI-6717] Fix downgrade handler for 0.14.0
hudi-bot commented on PR #9467: URL: https://github.com/apache/hudi/pull/9467#issuecomment-1682688492 ## CI report: * 2ade66c64355778bea62ef8ef81c80b929f50b3f Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19339) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs
[ https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755645#comment-17755645 ] Surya Prasanna Yalla edited comment on HUDI-6596 at 8/17/23 5:20 PM: - I think we should use a different name to skipLocking, if acquiring the lock is skipped because we already acquired the lock, then we should use a different variable something like isLockAcquired or something. Without complicating the rollback logic, let us see all the cases where we use rollback. 1. Rollback failed writes: Lock has to be acquired until scheduling the rollback plans for pending instantsToRollback and for execution it need not acquire a lock. 2. Rollback a specific instant: Only schedule step needs to be under a lock. 3. Restore operation: Entire operation needs to be under a lock. For rollbackFailedWrites method, break it down to two Stages *Stage 1: Scheduling stage* Step 1: Acquire lock and reload active timeline Step 2: getInstantsToRollback Step 3: removeInflightFilesAlreadyRolledBack Step 4: getPendingRollbackInfos Step 5: Use existing plan or schedule rollback Step 6: Release lock *Stage 2: Execution stage* Step 7: Check if heartbeat exist for pending rollback plan. If yes abort else start an heartbeat and proceed further for executing it. Rollback operation are not that common. We only do rollback if something fails. So, it is not like .clean or .commit operations. So, we should be ok in seeing some noise. was (Author: suryaprasanna): I think we should use a different name to skipLocking, if we are acquiring the lock is skipped because we already acquired the lock, then we should use a different variable something like isLockAcquired or something. Without complicating the rollback logic, let us see all the cases where we use rollback. 1. Rollback failed writes: Lock has to be acquired until scheduling the rollback plans for pending instantsToRollback and for execution it need not acquire a lock. 2. Rollback a specific instant: Only schedule step needs to be under a lock. 3. Restore operation: Entire operation needs to be under a lock. For rollbackFailedWrites method, break it down to two Stages *Stage 1: Scheduling stage* Step 1: Acquire lock and reload active timeline Step 2: getInstantsToRollback Step 3: removeInflightFilesAlreadyRolledBack Step 4: getPendingRollbackInfos Step 5: Use existing plan or schedule rollback Step 6: Release lock *Stage 2: Execution stage* Step 7: Check if heartbeat exist for pending rollback plan. If yes abort else start an heartbeat and proceed further for executing it. Rollback operation are not that common. We only do rollback if something fails. So, it is not like .clean or .commit operations. So, we should be ok in seeing some noise. > Propose rollback implementation changes to guard against concurrent jobs > - > > Key: HUDI-6596 > URL: https://issues.apache.org/jira/browse/HUDI-6596 > Project: Apache Hudi > Issue Type: Wish >Reporter: Krishen Bhan >Priority: Trivial > > h1. Issue > The existing rollback API in 0.14 > [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877] > executes a rollback plan, either taking in an existing rollback plan > provided by the caller for a previous rollback or attempt, or scheduling a > new rollback instant if none is provided. Currently it is not safe for two > concurrent jobs to call this API (when skipLocking=False and the callers > aren't already holding a lock), as this can lead to an issue where multiple > rollback requested plans are created or two jobs are executing the same > rollback instant at the same time. > h1. Proposed change > One way to resolve this issue is to refactor this rollback function such that > if skipLocking=false, the following steps are followed > # Acquire the table lock > # Reload the active timeline > # Look at the active timeline to see if there is a inflight rollback instant > from a previous rollback attempt, if it exists then assign this is as the > rollback plan to execute. Also, check if a pending rollback plan was passed > in by caller. Then it executes the following steps depending on whether the > caller passed a pending rollback instant plan. > ## [a] If a pending inflight rollback plan was passed in by caller, then > check that there is a previous attempted rollback instant on timeline (and > that the instant times match) and continue to use this rollback plan. If that > isn't the case, then raise a rollback exception since this means another job > has concurrently already executed this plan. Note that in a valid HUDI > dataset there can be at most one
[jira] [Commented] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs
[ https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755645#comment-17755645 ] Surya Prasanna Yalla commented on HUDI-6596: I think we should use a different name to skipLocking, if we are acquiring the lock is skipped because we already acquired the lock, then we should use a different variable something like isLockAcquired or something. Without complicating the rollback logic, let us see all the cases where we use rollback. 1. Rollback failed writes: Lock has to be acquired until scheduling the rollback plans for pending instantsToRollback and for execution it need not acquire a lock. 2. Rollback a specific instant: Only schedule step needs to be under a lock. 3. Restore operation: Entire operation needs to be under a lock. For rollbackFailedWrites method, break it down to two Stages *Stage 1: Scheduling stage* Step 1: Acquire lock and reload active timeline Step 2: getInstantsToRollback Step 3: removeInflightFilesAlreadyRolledBack Step 4: getPendingRollbackInfos Step 5: Use existing plan or schedule rollback Step 6: Release lock *Stage 2: Execution stage* Step 7: Check if heartbeat exist for pending rollback plan. If yes abort else start an heartbeat and proceed further for executing it. Rollback operation are not that common. We only do rollback if something fails. So, it is not like .clean or .commit operations. So, we should be ok in seeing some noise. > Propose rollback implementation changes to guard against concurrent jobs > - > > Key: HUDI-6596 > URL: https://issues.apache.org/jira/browse/HUDI-6596 > Project: Apache Hudi > Issue Type: Wish >Reporter: Krishen Bhan >Priority: Trivial > > h1. Issue > The existing rollback API in 0.14 > [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877] > executes a rollback plan, either taking in an existing rollback plan > provided by the caller for a previous rollback or attempt, or scheduling a > new rollback instant if none is provided. Currently it is not safe for two > concurrent jobs to call this API (when skipLocking=False and the callers > aren't already holding a lock), as this can lead to an issue where multiple > rollback requested plans are created or two jobs are executing the same > rollback instant at the same time. > h1. Proposed change > One way to resolve this issue is to refactor this rollback function such that > if skipLocking=false, the following steps are followed > # Acquire the table lock > # Reload the active timeline > # Look at the active timeline to see if there is a inflight rollback instant > from a previous rollback attempt, if it exists then assign this is as the > rollback plan to execute. Also, check if a pending rollback plan was passed > in by caller. Then it executes the following steps depending on whether the > caller passed a pending rollback instant plan. > ## [a] If a pending inflight rollback plan was passed in by caller, then > check that there is a previous attempted rollback instant on timeline (and > that the instant times match) and continue to use this rollback plan. If that > isn't the case, then raise a rollback exception since this means another job > has concurrently already executed this plan. Note that in a valid HUDI > dataset there can be at most one rollback instant for a corresponding commit > instant, which is why if we no longer see a pending rollback in timeline in > this phase we can safely assume that it had already been executed to > completion. > ## [b] If no pending inflight rollback plan was passed in by caller and no > pending rollback instant was found in timeline earlier, then schedule a new > rollback plan > # Now that a rollback plan and requested rollback instant time has been > assigned, check for an active heartbeat for the rollback instant time. If > there is one, then abort the rollback as that means there is a concurrent job > executing that rollback. If not, then start a heartbeat for that rollback > instant time. > # Release the table lock > # Execute the rollback plan and complete the rollback instant. Regardless of > whether this succeeds or fails with an exception, close the heartbeat. This > increases the chance that the next job that tries to call this rollback API > will follow through with the rollback and not abort due to an active previous > heartbeat > > * These steps will only be enforced for skipLocking=false, since if > skipLocking=true then that means the caller may already be explicitly holding > a table lock. In this case, acquiring the lock again in step (1) will fail. > * Acquiring a lock and reloading timelin
[jira] [Updated] (HUDI-6701) Explore use of UUID-6/7 as a replacement for current auto generated keys
[ https://issues.apache.org/jira/browse/HUDI-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-6701: -- Status: In Progress (was: Open) > Explore use of UUID-6/7 as a replacement for current auto generated keys > > > Key: HUDI-6701 > URL: https://issues.apache.org/jira/browse/HUDI-6701 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > Today, we auto generate string keys of the form > (HoodieRecord#generateSequenceId), which is highly compressible, esp compared > to uuidv1, when we store as a string column inside a parquet file. > {code:java} > public static String generateSequenceId(String instantTime, int > partitionId, long recordIndex) { > return instantTime + "_" + partitionId + "_" + recordIndex; > } > {code} > As a part of this task, we'd love to understand if > - Can uuid6 or 7, provide similar compressed storage footprint when written > as a column in a parquet file. > - can the current format be represented as a 160-bit number i.e 2 longs, 1 > int in storage? would that save us further in storage costs? > (Orthogonal consideration is the memory needed to hold the key string, which > can be higher than a 160bits. We can discuss this later, once we understand > storage footprint) > > Resources: > * https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/09/ > * https://github.com/uuid6/uuid6-ietf-draft > * https://github.com/uuid6/prototypes -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases
hudi-bot commented on PR #9459: URL: https://github.com/apache/hudi/pull/9459#issuecomment-1682500472 ## CI report: * 768e40ce1d035a021d88e5409f92bab846e4e4c0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19333) * 170678f0e7c429406a4565d85e77367908c1fb4b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19340) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases
hudi-bot commented on PR #9459: URL: https://github.com/apache/hudi/pull/9459#issuecomment-1682486781 ## CI report: * 768e40ce1d035a021d88e5409f92bab846e4e4c0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19333) * 170678f0e7c429406a4565d85e77367908c1fb4b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] linliu-code commented on pull request #9466: [HUDI-4756] Remove unused config "hoodie.assume.date.partitioning"
linliu-code commented on PR #9466: URL: https://github.com/apache/hudi/pull/9466#issuecomment-1682464922 > What do you mean for unused ? This is the from the task. I don't have enough context to confirm this. @nsivabalan Can you explain this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] haitham-eltaweel commented on issue #9460: Not valid month error when pulling new data from Oracle DB using HoodieDeltaStreamer
haitham-eltaweel commented on issue #9460: URL: https://github.com/apache/hudi/issues/9460#issuecomment-1682454495 it is timestamp type > @haitham-eltaweel What date format is `MODIFID_DT` present in oracle? what is the datatype? it is timestamp type -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql
hudi-bot commented on PR #9408: URL: https://github.com/apache/hudi/pull/9408#issuecomment-1682372687 ## CI report: * fadda82b0444d09d8718bc9002fbd1964e18bbf2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19332) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19338) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line
[ https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6242: - Description: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - {*}Base Files{*}: file format versions, any changes to any data types, file footers, file names. - {*}Log Files{*}: Block structure, content, names. - {*}Metadata Table{*}: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - {*}Table properties{*}: What's written to hoodie.properties. - *Marker files* : Can be left to the writer implementation. h2. Change summary: The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in schema evol impl. - TrueTime store/support for instant times - No more separate rollback action. make it a new state. Metadata table : - Encode filegroup ID and commit time along with file metadata Table Properties: - Partitioning information/indexing info was: This EPIC tracks changes to the Hudi storage format. Format change is anything that changes any bits related to - *Timeline* : active or archived timeline contents, file names. - {*}Base Files{*}: file format versions, any changes to any data types, file footers, file names. - {*}Log Files{*}: Block structure, content, names. - {*}Metadata Table{*}: (should we call this index table instead?) partition names, number of file groups, key/value schema and metadata to MDT row mappings. - {*}Table properties{*}: What's written to hoodie.properties. - *Marker files* : Can be left to the writer implementation. h2. Change summary: The following functionality should be supportable by the new format tech specs (at a minimum) Flexibility : - Ability to mix different types of base files within a single table or even a single file group (e.g images, json, vectors ...) - Easy integration of metadata for JVM and non-jvm clients Metafields : - Should _recordkey be uuid special handling? Additional Info: - Support encoding of watermarks/event time fields as first class citizen, for handling late arriving data. - Position based skipping of base file - Additional metadata to avoid more RPCs to scan base file/log blocks. - ML/Column family use-case? - Support having changeset of columns in each write, other headers Log : - Support writing updates as deletes and inserts, instead of logging as update to base file. - CDC format is GA. Table organization: - Support different logical partitions on the same data - Storage of table spread across buckets/root folders - Decouple table location from timeline, metadata. They can all be in different places Concurrency/Timeline: - Ability to support general purpose multi-table transactions, esp between data and metadata tables. - Support lockless/non-blocking transactions, where writers don't block each other even in face of conflicts. - Support for long lived instants in timeline, break down distinction between active/archived - Support checking of uniqueness constraints, even in face of two concurrent insert transactions. - Support precise time-travel queries - Support time-travel writes. - Support schema history tracking and aid in sc
[GitHub] [hudi] hudi-bot commented on pull request #9467: [HUDI-6717] Fix downgrade handler for 0.14.0
hudi-bot commented on PR #9467: URL: https://github.com/apache/hudi/pull/9467#issuecomment-1682297929 ## CI report: * 2ade66c64355778bea62ef8ef81c80b929f50b3f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19339) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9466: [HUDI-4756] Remove unused config "hoodie.assume.date.partitioning"
hudi-bot commented on PR #9466: URL: https://github.com/apache/hudi/pull/9466#issuecomment-1682297848 ## CI report: * d61eae7b243d92629914d2b95637922db6be3b08 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19337) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ad1happy2go commented on issue #9319: [SUPPORT] how to use HiveSyncConfig instead of hive configs in DataSourceWriteOptions object
ad1happy2go commented on issue #9319: URL: https://github.com/apache/hudi/issues/9319#issuecomment-1682287275 @zlinsc Yes, We were standardising configs for new release and HoodieSyncConfig should be used for all meta sync related configuration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #9467: [HUDI-6717] Fix downgrade handler for 0.14.0
hudi-bot commented on PR #9467: URL: https://github.com/apache/hudi/pull/9467#issuecomment-1682284936 ## CI report: * 2ade66c64355778bea62ef8ef81c80b929f50b3f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lokeshj1703 commented on pull request #9467: [HUDI-6717] Fix downgrade handler for 0.14.0
lokeshj1703 commented on PR #9467: URL: https://github.com/apache/hudi/pull/9467#issuecomment-1682243995 @nsivabalan @codope Please review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-6717) Fix downgrade handler for 0.14.0
[ https://issues.apache.org/jira/browse/HUDI-6717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6717: - Labels: pull-request-available (was: ) > Fix downgrade handler for 0.14.0 > > > Key: HUDI-6717 > URL: https://issues.apache.org/jira/browse/HUDI-6717 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > > Since the log block version (due to delete block change) has been upgraded in > 0.14.0, the delete blocks can not be read in 0.13.0 or earlier. > Similarly the addition of record level index field in metadata table leads to > column drop error on downgrade. The Jira aims to fix the downgrade handler to > trigger compaction and delete metadata table if user wishes to downgrade from > version six (0.14.0) to version 5 (0.13.0). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6717) Fix downgrade handler for 0.14.0
Lokesh Jain created HUDI-6717: - Summary: Fix downgrade handler for 0.14.0 Key: HUDI-6717 URL: https://issues.apache.org/jira/browse/HUDI-6717 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain Assignee: Lokesh Jain Since the log block version (due to delete block change) has been upgraded in 0.14.0, the delete blocks can not be read in 0.13.0 or earlier. Similarly the addition of record level index field in metadata table leads to column drop error on downgrade. The Jira aims to fix the downgrade handler to trigger compaction and delete metadata table if user wishes to downgrade from version six (0.14.0) to version 5 (0.13.0). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4631) Enhance retries for failed writes w/ write conflicts in a multi writer scenarios
[ https://issues.apache.org/jira/browse/HUDI-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit reassigned HUDI-4631: - Assignee: Sagar Sumit (was: sivabalan narayanan) > Enhance retries for failed writes w/ write conflicts in a multi writer > scenarios > > > Key: HUDI-4631 > URL: https://issues.apache.org/jira/browse/HUDI-4631 > Project: Apache Hudi > Issue Type: Improvement > Components: multi-writer >Reporter: sivabalan narayanan >Assignee: Sagar Sumit >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > > lets say there are two writers from t0 to t5. so hudi fails w2 and succeeds > w1. and user restarts w2 and for next 5 mins, lets say there are no other > overlapping writers. So the same write from w2 will now succeed. so, whenever > there is a write conflict and pipeline fails, all user needs to do is, just > restart the pipeline or retry to ingest the same batch. > > Ask: can we add retries within hudi during such failures. Anyways, in most > cases, users just restart the pipeline in such cases. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4631) Enhance retries for failed writes w/ write conflicts in a multi writer scenarios
[ https://issues.apache.org/jira/browse/HUDI-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-4631: -- Fix Version/s: 1.0.0 > Enhance retries for failed writes w/ write conflicts in a multi writer > scenarios > > > Key: HUDI-4631 > URL: https://issues.apache.org/jira/browse/HUDI-4631 > Project: Apache Hudi > Issue Type: Improvement > Components: multi-writer >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > > lets say there are two writers from t0 to t5. so hudi fails w2 and succeeds > w1. and user restarts w2 and for next 5 mins, lets say there are no other > overlapping writers. So the same write from w2 will now succeed. so, whenever > there is a write conflict and pipeline fails, all user needs to do is, just > restart the pipeline or retry to ingest the same batch. > > Ask: can we add retries within hudi during such failures. Anyways, in most > cases, users just restart the pipeline in such cases. > -- This message was sent by Atlassian Jira (v8.20.10#820010)