date:20230817



codope commented on code in PR #8233:
URL: https://github.com/apache/hudi/pull/8233#discussion_r1298071684


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -123,6 +126,24 @@ object HoodieSparkSqlWriter {
 streamingWritesParamsOpt: Option[StreamingWriteParams] = 
Option.empty,
 hoodieWriteClient: Option[SparkRDDWriteClient[_]] = Option.empty):
   (Boolean, HOption[String], HOption[String], HOption[String], 
SparkRDDWriteClient[_], HoodieTableConfig) = {
+//TODO reuse DataWritingCommand sparkPlan, reduce the number of sql list 
in SPARK UI SQL tag, rendering raw DAG

Review Comment:
   Will it incure some overhead if we don't reuse? Why not complete the TODO in 
this PR itself?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9473: [HUDI-6724] - Defaulting previous Instant time to init time to enable full read of initial commit



hudi-bot commented on PR #9473:
URL: https://github.com/apache/hudi/pull/9473#issuecomment-1683412312

   
   ## CI report:
   
   * ccdd0648b943bb2f5c3325c69887f4d9d4d7a117 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9473: [HUDI-6724] - Defaulting previous Instant time to init time to enable full read of initial commit



hudi-bot commented on PR #9473:
URL: https://github.com/apache/hudi/pull/9473#issuecomment-1683419250

   
   ## CI report:
   
   * ccdd0648b943bb2f5c3325c69887f4d9d4d7a117 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19347)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.



hudi-bot commented on PR #9472:
URL: https://github.com/apache/hudi/pull/9472#issuecomment-1683419217

   
   ## CI report:
   
   * dba536eaf3fbc3cade137d7c9d24c705e8263ad9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19346)
 
   * 13a06dff7a03d861232980b79baf924e31d55ff7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.



hudi-bot commented on PR #9472:
URL: https://github.com/apache/hudi/pull/9472#issuecomment-1683379797

   
   ## CI report:
   
   * dba536eaf3fbc3cade137d7c9d24c705e8263ad9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19346)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lokesh-lingarajan-0310 opened a new pull request, #9473: [HUDI-6724] - Defaulting previous Instant time to init time to enable full read of initial commit



lokesh-lingarajan-0310 opened a new pull request, #9473:
URL: https://github.com/apache/hudi/pull/9473

   This will happen in new onboarding as the old code will initialize 
prev=start = firstcommit-time, incremental read following this will always get 
entries > prev, which case we will skip part of first commit in processing.
   
   ### Change Logs
   
   Initialized prevInstance of commit to default 000 to avoid skipping parts of 
first commit
   
   ### Impact
   
   Medium
   
   ### Risk level (write none, low medium or high below)
   
   Medium
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.



hudi-bot commented on PR #9472:
URL: https://github.com/apache/hudi/pull/9472#issuecomment-1683373888

   
   ## CI report:
   
   * dba536eaf3fbc3cade137d7c9d24c705e8263ad9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] jiangzzwy opened a new issue, #9474: ClassNotFoundException: MergeOnReadInputSplit



jiangzzwy opened a new issue, #9474:
URL: https://github.com/apache/hudi/issues/9474

   ### Environment
   
   - Flink:1.17.1
   - hudi:1.14.0-rc
   - hadoop:3.2.2
   
   ### init.sql script
   `
   SET 'state.checkpoints.dir' = 'hdfs:///hudi/checkpoints/';
   SET 'execution.checkpointing.interval' = '20s';
   SET 'execution.checkpointing.min-pause' = '5s';
   SET 'execution.checkpointing.max-concurrent-checkpoints' = '1';
   
   
   add jar '/export/server/flink-1.17.1/hudi-flink1.17-bundle-0.14.0-rc1.jar;
   
   create table t_hudi_user(
  id BIGINT,
  name STRING,
  age INT,
  sex BOOLEAN,
  city String,
  birth timestamp(3)
)
PARTITIONED BY (birth)
WITH
 (
   'connector' = 'hudi',
   'hoodie.datasource.write.recordkey.field' = 'id',
   'path' = 'hdfs://CentOS:9000/hudi/t_hudi_user',
   'table.type' = 'MERGE_ON_READ',
   'compaction.trigger.strategy' = 'num_or_time',
   'compaction.delta_commits' = '3',
   'compaction.delta_seconds' = '300',
   'hoodie.datasource.write.hive_style_partitioning' = 'true',
   'write.datetime.partitioning'='true',
   'write.partition.format'='-MM-dd',
   'hive_sync.assume_date_partitioning' = 'true',
   'hive_sync.mode' = 'hms',
   'write.precombine.field' = 'birth',
   'changelog.enabled' = 'true',
   'read.streaming.enabled' = 'true',
   'read.streaming.check-interval' = '3',
   'compaction.tasks' = '2',
   'hive_sync.enable' = 'true',
   'hive_sync.table' = 't_hudi_user',
   'hive_sync.db' = 'default',
   'hive_sync.metastore.uris' = 'thrift://192.168.42.129:9083',
   'hoodie.datasource.hive_sync.support_timestamp' = 'true'
);
   `
   when i execute command as follow,the console termnal raise such error 
`java.lang.ClassNotFoundException: 
org.apache.hudi.table.format.mor.MergeOnReadInputSplit`.I'm sure the 
`MergeOnReadInputSplit` class has already complied in 
·hudi-flink1.17-bundle-0.14.0-rc1.jar` jar file.
   `
   
![image](https://github.com/apache/hudi/assets/23492991/33062891-da14-40cd-b591-2c37575a129f)
   
![image](https://github.com/apache/hudi/assets/23492991/d00114af-aaae-4212-a7b1-a96d64a940f6)
   
   But inserting is okay, querying is not, which makes me feel very strange！！
   
![image](https://github.com/apache/hudi/assets/23492991/f5d506ae-47d9-4cef-b5dc-a4d55c68c0d5)
   
   I had try a lower flink verison 1.14.x has no tihs problem
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-3625) [RFC-60] Optimized storage layout for cloud object stores

2023-08-17 Thread Shawn Chang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Chang reassigned HUDI-3625:
-

Assignee: Shawn Chang  (was: Udit Mehrotra)

> [RFC-60] Optimized storage layout for cloud object stores
> -
>
> Key: HUDI-3625
> URL: https://issues.apache.org/jira/browse/HUDI-3625
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: core
>Reporter: Udit Mehrotra
>Assignee: Shawn Chang
>Priority: Major
>  Labels: hudi-umbrellas, pull-request-available
> Fix For: 1.0.0
>
>
> Amazon S3 among other cloud object stores, throttle requests based on object 
> prefix => 
> [https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/].
>  Hudi follows the traditional Hive storage layout, with files being stored 
> under separate partition paths under a common table path/prefix. This 
> introduces the potential for throttling because of request limits being 
> reached for the common table path/prefix, when writing significant number of 
> files concurrently.
> We propose implementing an alternate storage layout, that would be more 
> suitable for cloud object stores like S3 to avoid running into throttling 
> issues as the data scales. At a high level, we need to be able to distribute 
> data files evenly across randomly generated prefixes, so that request limits 
> get distributed across those prefixes, instead of a single table prefix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6724) Initializing prevInstance to HoodieTimeline.INIT_INSTANT_TS to avoid partial reading of first commit



 [ 
https://issues.apache.org/jira/browse/HUDI-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6724:
-
Labels: pull-request-available  (was: )

> Initializing prevInstance to HoodieTimeline.INIT_INSTANT_TS to avoid partial 
> reading of first commit
> 
>
> Key: HUDI-6724
> URL: https://issues.apache.org/jira/browse/HUDI-6724
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Lingarajan
>Priority: Major
>  Labels: pull-request-available
>
> Since object based incr jobs now have batching with in the commit, we can 
> end-up in a situation for the first commit where prevInstance is same as 
> startInstance according to existing code for batches within the first commit. 
> In this scenario when we incremental query rows > prevInstance, we will skip 
> the first commit as startInstance is also pointing to the same commit.
> This is due to defaulting prevInstance to startInstance in 
> generateQueryInfo API. 
> Fix is to have this default to HoodieTimeline.INIT_INSTANT_TS so batching can 
> continue on the first commit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-6724) Initializing prevInstance to HoodieTimeline.INIT_INSTANT_TS to avoid partial reading of first commit

2023-08-17 Thread Lokesh Lingarajan (Jira)

Lokesh Lingarajan created HUDI-6724:
---

 Summary: Initializing prevInstance to 
HoodieTimeline.INIT_INSTANT_TS to avoid partial reading of first commit
 Key: HUDI-6724
 URL: https://issues.apache.org/jira/browse/HUDI-6724
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Lingarajan


Since object based incr jobs now have batching with in the commit, we can 
end-up in a situation for the first commit where prevInstance is same as 
startInstance according to existing code for batches within the first commit. 

In this scenario when we incremental query rows > prevInstance, we will skip 
the first commit as startInstance is also pointing to the same commit.

This is due to defaulting prevInstance to startInstance in 
generateQueryInfo API. 

Fix is to have this default to HoodieTimeline.INIT_INSTANT_TS so batching can 
continue on the first commit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6723) Prototype and benchmark event-time based in MOR log merging



 [ 
https://issues.apache.org/jira/browse/HUDI-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6723:

Fix Version/s: 1.0.0

> Prototype and benchmark event-time based in MOR log merging
> ---
>
> Key: HUDI-6723
> URL: https://issues.apache.org/jira/browse/HUDI-6723
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-6723) Prototype and benchmark event-time based in MOR log merging

Ethan Guo created HUDI-6723:
---

 Summary: Prototype and benchmark event-time based in MOR log 
merging
 Key: HUDI-6723
 URL: https://issues.apache.org/jira/browse/HUDI-6723
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-6723) Prototype and benchmark event-time based in MOR log merging



 [ 
https://issues.apache.org/jira/browse/HUDI-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6723:
---

Assignee: Ethan Guo

> Prototype and benchmark event-time based in MOR log merging
> ---
>
> Key: HUDI-6723
> URL: https://issues.apache.org/jira/browse/HUDI-6723
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6719) Fix data inconsistency issues caused by concurrent clustering and delete partition.



 [ 
https://issues.apache.org/jira/browse/HUDI-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6719:
-
Labels: pull-request-available  (was: )

> Fix data inconsistency issues caused by concurrent clustering and delete 
> partition.
> ---
>
> Key: HUDI-6719
> URL: https://issues.apache.org/jira/browse/HUDI-6719
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ma Jian
>Priority: Major
>  Labels: pull-request-available
>
> Related issue: https://issues.apache.org/jira/browse/HUDI-5553
> The specific problem is that when concurrent replace commit operations are 
> executed, two replace commits may point to the same file ID, resulting in a 
> duplicate key error. The existing issue solved the problem of scheduling 
> delete partition while there are pending clustering or compaction operations, 
> which will be prevented in this case. However, this solution is not perfect 
> and may still cause data inconsistency if a clustering plan is scheduled 
> before the delete partition is committed. Because validation is one-way.In 
> this case, both replace commits will still contain duplicate keys, and the 
> table will become inconsistent when both plans are committed. This is very 
> fatal, and there are other similar scenarios that may bypass the validation 
> of the existing issue. Moreover, the existing issue is at the partition level 
> and is not precise enough.
> Here is my solution:
> !https://intranetproxy.alipay.com/skylark/lark/0/2023/png/62256341/1692328998008-f9dc6530-e44e-43e7-9b75-d760b55b3dfa.png|width=335,id=WXCCX!
> As shown in the figure, both drop partition and clustering will go through a 
> period of time that is not registered to the timeline, which is the scenario 
> that the previous issue did not solve. Here, I register the replace file IDs 
> involved in each replace commit to the active timeline (the replace commit 
> timeline that has been submitted has saved partitionToReplaceFileIds, and 
> only pending requests need to be processed). Since in the case of Spark SQL, 
> delete partition creates a requested commit in advance during write, which is 
> inconvenient to handle, I save the pending replace commit's 
> partitionToReplaceFileIds information to the inflight commit's extra 
> metadata. Therefore, each time drop partition or clustering is executed, it 
> only needs to read the partitionToReplaceFileIds information in the timeline 
> after ensuring that the inflight commit information has been saved to the 
> timeline to ensure that there are no duplicate file IDs and prevent this kind 
> of error from occurring.
> In simple terms, each replace commit will register the replace file ID 
> information to the timeline whether it is submitted or not, at the same time, 
> each submission will check this information to ensure that it will not be 
> repeated, so that any replace commit containing this file ID will be 
> prevented, ensuring that there are no duplicate keys.
> When this idea is also implemented on the compaction commit, the modification 
> involved in the related issue can be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6723) Prototype and benchmark event-time based in MOR log merging



 [ 
https://issues.apache.org/jira/browse/HUDI-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6723:

Status: In Progress  (was: Open)

> Prototype and benchmark event-time based in MOR log merging
> ---
>
> Key: HUDI-6723
> URL: https://issues.apache.org/jira/browse/HUDI-6723
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] majian1998 opened a new pull request, #9472: [HUDI-6719]Fix data inconsistency issues caused by concurrent clustering and delete partition.



majian1998 opened a new pull request, #9472:
URL: https://github.com/apache/hudi/pull/9472

   ### Change Logs
   Implemented a solution to prevent duplicate key errors in concurrent replace 
commit operations.
   Registered the replace file ID information to the timeline for each replace 
commit, whether it is submitted or not.
   Saved the pending replace commit's partitionToReplaceFileIds information to 
the inflight commit's extra metadata.
   Updated drop partition and clustering operations to read the 
partitionToReplaceFileIds information in the timeline to ensure no duplicate 
file IDs.
   Removed the modification involved in the related issue for compaction commit.
   
   ### Impact
   
   No public API or user-facing feature changes.
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   Related issue: https://issues.apache.org/jira/browse/HUDI-5553
   
   The specific problem is that when concurrent replace commit operations are 
executed, two replace commits may point to the same file ID, resulting in a 
duplicate key error. The existing issue solved the problem of scheduling delete 
partition while there are pending clustering or compaction operations, which 
will be prevented in this case. However, this solution is not perfect and may 
still cause data inconsistency if a clustering plan is scheduled before the 
delete partition is committed. Because validation is one-way.In this case, both 
replace commits will still contain duplicate keys, and the table will become 
inconsistent when both plans are committed. This is very fatal, and there are 
other similar scenarios that may bypass the validation of the existing issue. 
Moreover, the existing issue is at the partition level and is not precise 
enough.
   
   Here is my solution:
   
![image](https://github.com/apache/hudi/assets/47964462/6d8a3134-96a5-45ec-8ed0-ed2776b7ed24)
   
   As shown in the figure, both drop partition and clustering will go through a 
period of time that is not registered to the timeline, which is the scenario 
that the previous issue did not solve. Here, I register the replace file IDs 
involved in each replace commit to the active timeline (the replace commit 
timeline that has been submitted has saved partitionToReplaceFileIds, and only 
pending requests need to be processed). Since in the case of Spark SQL, delete 
partition creates a requested commit in advance during write, which is 
inconvenient to handle, I save the pending replace commit's 
partitionToReplaceFileIds information to the inflight commit's extra metadata. 
Therefore, each time drop partition or clustering is executed, it only needs to 
read the partitionToReplaceFileIds information in the timeline after ensuring 
that the inflight commit information has been saved to the timeline to ensure 
that there are no duplicate file IDs and prevent this kind of error from 
occurring.
   
   In simple terms, each replace commit will register the replace file ID 
information to the timeline whether it is submitted or not, at the same time, 
each submission will check this information to ensure that it will not be 
repeated, so that any replace commit containing this file ID will be prevented, 
ensuring that there are no duplicate keys.
   
   When this idea is also implemented on the compaction commit, the 
modification involved in the related issue can be removed.
   
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6539) New LSM tree style archived timeline



 [ 
https://issues.apache.org/jira/browse/HUDI-6539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6539:
-
Status: Patch Available  (was: In Progress)

> New LSM tree style archived timeline
> 
>
> Key: HUDI-6539
> URL: https://issues.apache.org/jira/browse/HUDI-6539
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6539) New LSM tree style archived timeline



 [ 
https://issues.apache.org/jira/browse/HUDI-6539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6539:
-
Reviewers: Vinoth Chandar

> New LSM tree style archived timeline
> 
>
> Key: HUDI-6539
> URL: https://issues.apache.org/jira/browse/HUDI-6539
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6539) New LSM tree style archived timeline



 [ 
https://issues.apache.org/jira/browse/HUDI-6539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6539:
-
Status: In Progress  (was: Open)

> New LSM tree style archived timeline
> 
>
> Key: HUDI-6539
> URL: https://issues.apache.org/jira/browse/HUDI-6539
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6719) Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-17 Thread Ma Jian (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ma Jian updated HUDI-6719:
--
Description: 
Related issue: https://issues.apache.org/jira/browse/HUDI-5553

The specific problem is that when concurrent replace commit operations are 
executed, two replace commits may point to the same file ID, resulting in a 
duplicate key error. The existing issue solved the problem of scheduling delete 
partition while there are pending clustering or compaction operations, which 
will be prevented in this case. However, this solution is not perfect and may 
still cause data inconsistency if a clustering plan is scheduled before the 
delete partition is committed. Because validation is one-way.In this case, both 
replace commits will still contain duplicate keys, and the table will become 
inconsistent when both plans are committed. This is very fatal, and there are 
other similar scenarios that may bypass the validation of the existing issue. 
Moreover, the existing issue is at the partition level and is not precise 
enough.

Here is my solution:

!https://intranetproxy.alipay.com/skylark/lark/0/2023/png/62256341/1692328998008-f9dc6530-e44e-43e7-9b75-d760b55b3dfa.png|width=335,id=WXCCX!

As shown in the figure, both drop partition and clustering will go through a 
period of time that is not registered to the timeline, which is the scenario 
that the previous issue did not solve. Here, I register the replace file IDs 
involved in each replace commit to the active timeline (the replace commit 
timeline that has been submitted has saved partitionToReplaceFileIds, and only 
pending requests need to be processed). Since in the case of Spark SQL, delete 
partition creates a requested commit in advance during write, which is 
inconvenient to handle, I save the pending replace commit's 
partitionToReplaceFileIds information to the inflight commit's extra metadata. 
Therefore, each time drop partition or clustering is executed, it only needs to 
read the partitionToReplaceFileIds information in the timeline after ensuring 
that the inflight commit information has been saved to the timeline to ensure 
that there are no duplicate file IDs and prevent this kind of error from 
occurring.

In simple terms, each replace commit will register the replace file ID 
information to the timeline whether it is submitted or not, at the same time, 
each submission will check this information to ensure that it will not be 
repeated, so that any replace commit containing this file ID will be prevented, 
ensuring that there are no duplicate keys.

When this idea is also implemented on the compaction commit, the modification 
involved in the related issue can be removed.

  was:
Related issue: https://issues.apache.org/jira/browse/HUDI-5553



The specific problem is that when concurrent replace commit operations are 
executed, two replace commits may point to the same file ID, resulting in a 
duplicate key error. The existing issue solved the problem of scheduling delete 
partition while there are pending clustering or compaction operations, which 
will be prevented in this case. However, this solution is not perfect and may 
still cause data inconsistency if a clustering plan is scheduled before the 
delete partition is committed. Because validation is one-way.In this case, both 
replace commits will still contain duplicate keys, and the table will become 
inconsistent when both plans are committed. This is very fatal, and there are 
other similar scenarios that may bypass the validation of the existing issue. 
Moreover, the existing issue is at the partition level and is not precise 
enough.



Here is my solution:

!https://intranetproxy.alipay.com/skylark/lark/0/2023/png/62256341/1692328998008-f9dc6530-e44e-43e7-9b75-d760b55b3dfa.png|width=335,id=WXCCX!

As shown in the figure, both drop partition and clustering will go through a 
period of time that is not registered to the timeline, which is the scenario 
that the previous issue did not solve. Here, I register the replace file IDs 
involved in each replace commit to the active timeline (the replace commit 
timeline that has been submitted has saved partitionToReplaceFileIds, and only 
pending requests need to be processed). Since in the case of Spark SQL, delete 
partition creates a requested commit in advance during write, which is 
inconvenient to handle, I save the pending replace commit's 
partitionToReplaceFileIds information to the inflight commit's extra metadata. 
Therefore, each time drop partition or clustering is executed, it only needs to 
read the partitionToReplaceFileIds information in the timeline after ensuring 
that the inflight commit information has been saved to the timeline to ensure 
that there are no duplicate file IDs and prevent this kind of error from 
occurring.



In simple terms, each replace commit will register the replace file ID 
information to the timeli

[GitHub] [hudi] hudi-bot commented on pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases



hudi-bot commented on PR #9459:
URL: https://github.com/apache/hudi/pull/9459#issuecomment-1683342280

   
   ## CI report:
   
   * 170678f0e7c429406a4565d85e77367908c1fb4b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19340)
 
   * 13cd8f29dd7aceccb83a9a44aa464d70d55bb57c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19345)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6720) Prototype and benchmark position- and key-based updates and deletes in MOR



 [ 
https://issues.apache.org/jira/browse/HUDI-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6720:

Status: In Progress  (was: Open)

> Prototype and benchmark position- and key-based updates and deletes in MOR
> --
>
> Key: HUDI-6720
> URL: https://issues.apache.org/jira/browse/HUDI-6720
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6721) Prototype and benchmark partial updates in MOR log merging



 [ 
https://issues.apache.org/jira/browse/HUDI-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6721:

Fix Version/s: 1.0.0

> Prototype and benchmark partial updates in MOR log merging
> --
>
> Key: HUDI-6721
> URL: https://issues.apache.org/jira/browse/HUDI-6721
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6721) Prototype and benchmark partial updates in MOR log merging



 [ 
https://issues.apache.org/jira/browse/HUDI-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6721:

Epic Link: HUDI-6722

> Prototype and benchmark partial updates in MOR log merging
> --
>
> Key: HUDI-6721
> URL: https://issues.apache.org/jira/browse/HUDI-6721
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6721) Prototype and benchmark partial updates in MOR log merging



 [ 
https://issues.apache.org/jira/browse/HUDI-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6721:

Status: In Progress  (was: Open)

> Prototype and benchmark partial updates in MOR log merging
> --
>
> Key: HUDI-6721
> URL: https://issues.apache.org/jira/browse/HUDI-6721
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6720) Prototype and benchmark position- and key-based updates and deletes in MOR



 [ 
https://issues.apache.org/jira/browse/HUDI-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6720:

Fix Version/s: 1.0.0

> Prototype and benchmark position- and key-based updates and deletes in MOR
> --
>
> Key: HUDI-6720
> URL: https://issues.apache.org/jira/browse/HUDI-6720
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6720) Prototype and benchmark position- and key-based updates and deletes in MOR



 [ 
https://issues.apache.org/jira/browse/HUDI-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6720:

Epic Link: HUDI-6722

> Prototype and benchmark position- and key-based updates and deletes in MOR
> --
>
> Key: HUDI-6720
> URL: https://issues.apache.org/jira/browse/HUDI-6720
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-5386) Cleaning conflicts in occ mode



 [ 
https://issues.apache.org/jira/browse/HUDI-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-5386:
--
Epic Link: HUDI-1456
Fix Version/s: 0.14.0
   0.14.1

> Cleaning conflicts in occ mode
> --
>
> Key: HUDI-5386
> URL: https://issues.apache.org/jira/browse/HUDI-5386
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: HunterXHunter
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0, 0.14.1
>
> Attachments: image-2022-12-14-11-26-21-995.png, 
> image-2022-12-14-11-26-37-252.png
>
>
> {code:java}
> configuration parameter: 
> 'hoodie.cleaner.policy.failed.writes' = 'LAZY'
> 'hoodie.write.concurrency.mode' = 'optimistic_concurrency_control' {code}
> Because `getInstantsToRollback` is not locked, multiple writes get the same 
> `instantsToRollback`, the same `instant` will be deleted multiple times and 
> the same `rollback.inflight` will be created multiple times.
> !image-2022-12-14-11-26-37-252.png!
> !image-2022-12-14-11-26-21-995.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-6722) Performance and API improvement on record merging

Ethan Guo created HUDI-6722:
---

 Summary: Performance and API improvement on record merging
 Key: HUDI-6722
 URL: https://issues.apache.org/jira/browse/HUDI-6722
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6722) Performance and API improvement on record merging



 [ 
https://issues.apache.org/jira/browse/HUDI-6722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6722:

Fix Version/s: 1.0.0

> Performance and API improvement on record merging
> -
>
> Key: HUDI-6722
> URL: https://issues.apache.org/jira/browse/HUDI-6722
> Project: Apache Hudi
>  Issue Type: Epic
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6722) Performance and API improvement on record merging



 [ 
https://issues.apache.org/jira/browse/HUDI-6722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6722:

Issue Type: Epic  (was: New Feature)

> Performance and API improvement on record merging
> -
>
> Key: HUDI-6722
> URL: https://issues.apache.org/jira/browse/HUDI-6722
> Project: Apache Hudi
>  Issue Type: Epic
>Reporter: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-6722) Performance and API improvement on record merging



 [ 
https://issues.apache.org/jira/browse/HUDI-6722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6722:
---

Assignee: Ethan Guo

> Performance and API improvement on record merging
> -
>
> Key: HUDI-6722
> URL: https://issues.apache.org/jira/browse/HUDI-6722
> Project: Apache Hudi
>  Issue Type: Epic
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases



hudi-bot commented on PR #9459:
URL: https://github.com/apache/hudi/pull/9459#issuecomment-1683337770

   
   ## CI report:
   
   * 170678f0e7c429406a4565d85e77367908c1fb4b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19340)
 
   * 13cd8f29dd7aceccb83a9a44aa464d70d55bb57c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6721) Prototype and benchmark partial updates in MOR log merging

Ethan Guo created HUDI-6721:
---

 Summary: Prototype and benchmark partial updates in MOR log merging
 Key: HUDI-6721
 URL: https://issues.apache.org/jira/browse/HUDI-6721
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-6720) Prototype and benchmark position- and key-based updates and deletes



 [ 
https://issues.apache.org/jira/browse/HUDI-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6720:
---

Assignee: Ethan Guo

> Prototype and benchmark position- and key-based updates and deletes
> ---
>
> Key: HUDI-6720
> URL: https://issues.apache.org/jira/browse/HUDI-6720
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6720) Prototype and benchmark position- and key-based updates and deletes in MOR



 [ 
https://issues.apache.org/jira/browse/HUDI-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6720:

Summary: Prototype and benchmark position- and key-based updates and 
deletes in MOR  (was: Prototype and benchmark position- and key-based updates 
and deletes)

> Prototype and benchmark position- and key-based updates and deletes in MOR
> --
>
> Key: HUDI-6720
> URL: https://issues.apache.org/jira/browse/HUDI-6720
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-6721) Prototype and benchmark partial updates in MOR log merging



 [ 
https://issues.apache.org/jira/browse/HUDI-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-6721:
---

Assignee: Ethan Guo

> Prototype and benchmark partial updates in MOR log merging
> --
>
> Key: HUDI-6721
> URL: https://issues.apache.org/jira/browse/HUDI-6721
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6720) Prototype and benchmark position- and key-based updates and deletes



 [ 
https://issues.apache.org/jira/browse/HUDI-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6720:

Issue Type: New Feature  (was: Task)

> Prototype and benchmark position- and key-based updates and deletes
> ---
>
> Key: HUDI-6720
> URL: https://issues.apache.org/jira/browse/HUDI-6720
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6720) Prototype and benchmark position- and key-based updates and deletes



 [ 
https://issues.apache.org/jira/browse/HUDI-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6720:

Summary: Prototype and benchmark position- and key-based updates and 
deletes  (was: Benchmark position- and key-based updates and deletes)

> Prototype and benchmark position- and key-based updates and deletes
> ---
>
> Key: HUDI-6720
> URL: https://issues.apache.org/jira/browse/HUDI-6720
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-6720) Benchmark position- and key-based updates and deletes

Ethan Guo created HUDI-6720:
---

 Summary: Benchmark position- and key-based updates and deletes
 Key: HUDI-6720
 URL: https://issues.apache.org/jira/browse/HUDI-6720
 Project: Apache Hudi
  Issue Type: Task
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly



hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1683332246

   
   ## CI report:
   
   * a0db166250fe0220494b18b0c0d343d1a3adae7b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19342)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-6719) Fix data inconsistency issues caused by concurrent clustering and delete partition.

2023-08-17 Thread Ma Jian (Jira)

Ma Jian created HUDI-6719:
-

 Summary: Fix data inconsistency issues caused by concurrent 
clustering and delete partition.
 Key: HUDI-6719
 URL: https://issues.apache.org/jira/browse/HUDI-6719
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ma Jian


Related issue: https://issues.apache.org/jira/browse/HUDI-5553



The specific problem is that when concurrent replace commit operations are 
executed, two replace commits may point to the same file ID, resulting in a 
duplicate key error. The existing issue solved the problem of scheduling delete 
partition while there are pending clustering or compaction operations, which 
will be prevented in this case. However, this solution is not perfect and may 
still cause data inconsistency if a clustering plan is scheduled before the 
delete partition is committed. Because validation is one-way.In this case, both 
replace commits will still contain duplicate keys, and the table will become 
inconsistent when both plans are committed. This is very fatal, and there are 
other similar scenarios that may bypass the validation of the existing issue. 
Moreover, the existing issue is at the partition level and is not precise 
enough.



Here is my solution:

!https://intranetproxy.alipay.com/skylark/lark/0/2023/png/62256341/1692328998008-f9dc6530-e44e-43e7-9b75-d760b55b3dfa.png|width=335,id=WXCCX!

As shown in the figure, both drop partition and clustering will go through a 
period of time that is not registered to the timeline, which is the scenario 
that the previous issue did not solve. Here, I register the replace file IDs 
involved in each replace commit to the active timeline (the replace commit 
timeline that has been submitted has saved partitionToReplaceFileIds, and only 
pending requests need to be processed). Since in the case of Spark SQL, delete 
partition creates a requested commit in advance during write, which is 
inconvenient to handle, I save the pending replace commit's 
partitionToReplaceFileIds information to the inflight commit's extra metadata. 
Therefore, each time drop partition or clustering is executed, it only needs to 
read the partitionToReplaceFileIds information in the timeline after ensuring 
that the inflight commit information has been saved to the timeline to ensure 
that there are no duplicate file IDs and prevent this kind of error from 
occurring.



In simple terms, each replace commit will register the replace file ID 
information to the timeline whether it is submitted or not, at the same time, 
each submission will check this information to ensure that it will not be 
repeated, so that any replace commit containing this file ID will be prevented, 
ensuring that there are no duplicate keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] prathit06 commented on a diff in pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases



prathit06 commented on code in PR #9459:
URL: https://github.com/apache/hudi/pull/9459#discussion_r1297953678


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java:
##
@@ -479,4 +479,19 @@ private Map 
getGroupOffsets(KafkaConsumer consumer, Set

[GitHub] [hudi] yyh2954360585 closed issue #9471: [SUPPORT] When using Deltasteamer JdbcSource to extract data, there are issues with data loss and slow query of source side data



yyh2954360585 closed issue #9471: [SUPPORT] When using Deltasteamer JdbcSource 
to extract data, there are issues with data loss and slow query of source side 
data
URL: https://github.com/apache/hudi/issues/9471


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yyh2954360585 opened a new issue, #9471: [SUPPORT] When using Deltasteamer JdbcSource to extract data, there are issues with data loss and slow query of source side data



yyh2954360585 opened a new issue, #9471:
URL: https://github.com/apache/hudi/issues/9471

   **Describe the problem you faced**
  Q1:
Assuming the source table order table has a total data volume of 5 
million. Synchronize using deltasteamer JdbcSource
Hudi conf:
` --hoodie-conf hoodie.deltastreamer.jdbc.incr.pull=true` 
`--hoodie-conf 
hoodie.deltastreamer.jdbc.table.incr.column.name=update_date`
   `--source-limit 10`
   `--continuous`
When deltasteamer synchronizes to 40w data, the current 
lastCheckpoint=2023-08-17 14:55 0:00:00 So the SQL for 
 incrementalFetch Method to query source data is:
 `select (select * from order where update_date>"2023-08-17 14:55 
0:00:00" order by update_date limit 10) rdbms_table`
Assuming that there is 20 data in the updateDate field of my 
order table, which is equal to "2023-08-17 14:55 1:00:000" will 
 only obtain 10 rows of data due to sourceLimit=10, and 
will also lose 10 rows of data.
  
  Q2:
  Why are these two parameters set?
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.13.1
   
   * Spark version :3.2.1
   
   * Hive version :3.1.3
   
   * Hadoop version :3.3.3
   
   * Storage (HDFS/S3/GCS..) :HDFS
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zyclove opened a new issue, #9470: [SUPPORT] spark-sql hudi 0.12.3 Caused by: org.apache.avro.AvroTypeException: Found long, expecting union



zyclove opened a new issue, #9470:
URL: https://github.com/apache/hudi/issues/9470

   **_Tips before filing an issue_**
   
   spark-sql query hudi table with the error.
   
   select count(1) from hudi_table ;
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. hudi mor table
   2. write data
   3. query counts
   4.error as follow
   
   ```
   1. Caused by: org.apache.avro.AvroTypeException: Found long, expecting union
   
   2. Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableLong
   ``` 
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.12.3
   
   * Spark version :3.2.1
   
   * Hive version :3.1.2
   
   * Hadoop version :3.2.2
   
   * Storage (HDFS/S3/GCS..) :s3
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```
   23/08/18 10:51:07 INFO TaskSetManager: Starting task 8.0 in stage 1.0 (TID 
43) (172.30.15.96, executor 4, partition 8, PROCESS_LOCAL, 5126 bytes) 
taskResourceAssignments Map()
   23/08/18 10:51:07 WARN TaskSetManager: Lost task 5.0 in stage 1.0 (TID 40) 
(172.30.15.96 executor 4): org.apache.hudi.exception.HoodieException: Exception 
when reading log file 
   at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:377)
   at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:220)
   at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:209)
   at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:113)
   at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:106)
   at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:343)
   at 
org.apache.hudi.LogFileIterator$.scanLog(LogFileIterator.scala:305)
   at org.apache.hudi.LogFileIterator.(LogFileIterator.scala:89)
   at 
org.apache.hudi.RecordMergingFileIterator.(LogFileIterator.scala:180)
   at 
org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:104)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
   at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
   at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
   at org.apache.spark.scheduler.Task.run(Task.scala:131)
   at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.avro.AvroTypeException: Found long, expecting union
   at 
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:308)
   at org.apache.avro.io.parsing.Parser.advance(Parser.java:86)
   at 
org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:275)
   at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:187)
   at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
   at 
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:259)
   at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247)
   at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
   at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
   at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.j

[GitHub] [hudi] danny0405 commented on a diff in pull request #9416: [HUDI-6678] Fix the acquisition of clean&rollback instants to archive



danny0405 commented on code in PR #9416:
URL: https://github.com/apache/hudi/pull/9416#discussion_r1297928994


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -452,107 +431,137 @@ private Stream 
getCommitInstantsToArchive() throws IOException {
   ? CompactionUtils.getOldestInstantToRetainForCompaction(
   table.getActiveTimeline(), 
config.getInlineCompactDeltaCommitMax())
   : Option.empty();
+  oldestInstantToRetainCandidates.add(oldestInstantToRetainForCompaction);
 
-  // The clustering commit instant can not be archived unless we ensure 
that the replaced files have been cleaned,
+  // 3. The clustering commit instant can not be archived unless we ensure 
that the replaced files have been cleaned,
   // without the replaced files metadata on the timeline, the fs view 
would expose duplicates for readers.
   // Meanwhile, when inline or async clustering is enabled, we need to 
ensure that there is a commit in the active timeline
   // to check whether the file slice generated in pending clustering after 
archive isn't committed.
   Option oldestInstantToRetainForClustering =
   
ClusteringUtils.getOldestInstantToRetainForClustering(table.getActiveTimeline(),
 table.getMetaClient());
+  oldestInstantToRetainCandidates.add(oldestInstantToRetainForClustering);
+
+  // 4. If metadata table is enabled, do not archive instants which are 
more recent than the last compaction on the
+  // metadata table.
+  if (table.getMetaClient().getTableConfig().isMetadataTableAvailable()) {
+try (HoodieTableMetadata tableMetadata = 
HoodieTableMetadata.create(table.getContext(), config.getMetadataConfig(), 
config.getBasePath())) {
+  Option latestCompactionTime = 
tableMetadata.getLatestCompactionTime();
+  if (!latestCompactionTime.isPresent()) {
+LOG.info("Not archiving as there is no compaction yet on the 
metadata table");
+return Collections.emptyList();
+  } else {
+LOG.info("Limiting archiving of instants to latest compaction on 
metadata table at " + latestCompactionTime.get());
+oldestInstantToRetainCandidates.add(Option.of(new HoodieInstant(
+HoodieInstant.State.COMPLETED, COMPACTION_ACTION, 
latestCompactionTime.get(;
+  }
+} catch (Exception e) {
+  throw new HoodieException("Error limiting instant archival based on 
metadata table", e);
+}
+  }
+
+  // 5. If this is a metadata table, do not archive the commits that live 
in data set
+  // active timeline. This is required by metadata table,
+  // see HoodieTableMetadataUtil#processRollbackMetadata for details.
+  if (table.isMetadataTable()) {
+HoodieTableMetaClient dataMetaClient = HoodieTableMetaClient.builder()
+
.setBasePath(HoodieTableMetadata.getDatasetBasePath(config.getBasePath()))
+.setConf(metaClient.getHadoopConf())
+.build();
+Option qualifiedEarliestInstant =
+TimelineUtils.getEarliestInstantForMetadataArchival(
+dataMetaClient.getActiveTimeline(), 
config.shouldArchiveBeyondSavepoint());
+
+// Do not archive the instants after the earliest commit (COMMIT, 
DELTA_COMMIT, and
+// REPLACE_COMMIT only, considering non-savepoint commit only if 
enabling archive
+// beyond savepoint) and the earliest inflight instant (all actions).
+// This is required by metadata table, see 
HoodieTableMetadataUtil#processRollbackMetadata
+// for details.
+// Todo: Remove #7580

Review Comment:
   We should keep at least clean commit on the active timeline right? What 
about the rollback.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs



 [ 
https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6596:
--
Epic Link: HUDI-1456
Reviewers: Sagar Sumit

>  Propose rollback implementation changes to guard against concurrent jobs
> -
>
> Key: HUDI-6596
> URL: https://issues.apache.org/jira/browse/HUDI-6596
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: Krishen Bhan
>Priority: Trivial
> Fix For: 1.0.0
>
>
> h1. Issue
> The existing rollback API in 0.14 
> [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877]
>  executes a rollback plan, either taking in an existing rollback plan 
> provided by the caller for a previous rollback or attempt, or scheduling a 
> new rollback instant if none is provided. Currently it is not safe for two 
> concurrent jobs to call this API (when skipLocking=False and the callers 
> aren't already holding a lock), as this can lead to an issue where multiple 
> rollback requested plans are created or two jobs are executing the same 
> rollback instant at the same time.
> h1. Proposed change
> One way to resolve this issue is to refactor this rollback function such that 
> if skipLocking=false, the following steps are followed
>  # Acquire the table lock
>  # Reload the active timeline
>  # Look at the active timeline to see if there is a inflight rollback instant 
> from a previous rollback attempt, if it exists then assign this is as the 
> rollback plan to execute. Also, check if a pending rollback plan was passed 
> in by caller. Then it executes the following steps depending on whether the 
> caller passed a pending rollback instant plan.
>  ##  [a] If a pending inflight rollback plan was passed in by caller, then 
> check that there is a previous attempted rollback instant on timeline (and 
> that the instant times match) and continue to use this rollback plan. If that 
> isn't the case, then raise a rollback exception since this means another job 
> has concurrently already executed this plan. Note that in a valid HUDI 
> dataset there can be at most one rollback instant for a corresponding commit 
> instant, which is why if we no longer see a pending rollback in timeline in 
> this phase we can safely assume that it had already been executed to 
> completion.
>  ##  [b] If no pending inflight rollback plan was passed in by caller and no 
> pending rollback instant was found in timeline earlier, then schedule a new 
> rollback plan
>  # Now that a rollback plan and requested rollback instant time has been 
> assigned, check for an active heartbeat for the rollback instant time. If 
> there is one, then abort the rollback as that means there is a concurrent job 
> executing that rollback. If not, then start a heartbeat for that rollback 
> instant time.
>  # Release the table lock
>  # Execute the rollback plan and complete the rollback instant. Regardless of 
> whether this succeeds or fails with an exception, close the heartbeat. This 
> increases the chance that the next job that tries to call this rollback API 
> will follow through with the rollback and not abort due to an active previous 
> heartbeat
>  
>  * These steps will only be enforced for  skipLocking=false, since if  
> skipLocking=true then that means the caller may already be explicitly holding 
> a table lock. In this case, acquiring the lock again in step (1) will fail.
>  * Acquiring a lock and reloading timeline for (1-3) will guard against data 
> race conditions where another job calls this rollback API at same time and 
> schedules its own rollback plan and instant. This is since if no rollback has 
> been attempted before for this instant, then before step (1), there is a 
> window of time where another concurrent rollback job could have scheduled a 
> rollback plan, failed execution, and cleaned up heartbeat, all while the 
> current rollback job is running. As a result, even if the current job was 
> passed in an empty pending rollback plan, it still needs to check the active 
> timeline to ensure that no new rollback pending instant has been created. 
>  * Using a heartbeat will signal to other callers in other jobs that there is 
> another job already executing this rollback. Checking for expired heartbeat 
> and (re)-starting the heartbeat has to be done under a lock, so that multiple 
> jobs don't each start it at the same time and assume that they are the only 
> ones that are heartbeating. 
>  * The table lock is no longer needed after (5), since it can now be safely 
> assumed that no other job (calling this rollback API) will execute this 
> rollback instant. 
> One example implementation to achieve this:
>  
> {code:java}
> @Deprecated
> public

[jira] [Updated] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs



 [ 
https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-6596:
--
Fix Version/s: 1.0.0

>  Propose rollback implementation changes to guard against concurrent jobs
> -
>
> Key: HUDI-6596
> URL: https://issues.apache.org/jira/browse/HUDI-6596
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: Krishen Bhan
>Priority: Trivial
> Fix For: 1.0.0
>
>
> h1. Issue
> The existing rollback API in 0.14 
> [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877]
>  executes a rollback plan, either taking in an existing rollback plan 
> provided by the caller for a previous rollback or attempt, or scheduling a 
> new rollback instant if none is provided. Currently it is not safe for two 
> concurrent jobs to call this API (when skipLocking=False and the callers 
> aren't already holding a lock), as this can lead to an issue where multiple 
> rollback requested plans are created or two jobs are executing the same 
> rollback instant at the same time.
> h1. Proposed change
> One way to resolve this issue is to refactor this rollback function such that 
> if skipLocking=false, the following steps are followed
>  # Acquire the table lock
>  # Reload the active timeline
>  # Look at the active timeline to see if there is a inflight rollback instant 
> from a previous rollback attempt, if it exists then assign this is as the 
> rollback plan to execute. Also, check if a pending rollback plan was passed 
> in by caller. Then it executes the following steps depending on whether the 
> caller passed a pending rollback instant plan.
>  ##  [a] If a pending inflight rollback plan was passed in by caller, then 
> check that there is a previous attempted rollback instant on timeline (and 
> that the instant times match) and continue to use this rollback plan. If that 
> isn't the case, then raise a rollback exception since this means another job 
> has concurrently already executed this plan. Note that in a valid HUDI 
> dataset there can be at most one rollback instant for a corresponding commit 
> instant, which is why if we no longer see a pending rollback in timeline in 
> this phase we can safely assume that it had already been executed to 
> completion.
>  ##  [b] If no pending inflight rollback plan was passed in by caller and no 
> pending rollback instant was found in timeline earlier, then schedule a new 
> rollback plan
>  # Now that a rollback plan and requested rollback instant time has been 
> assigned, check for an active heartbeat for the rollback instant time. If 
> there is one, then abort the rollback as that means there is a concurrent job 
> executing that rollback. If not, then start a heartbeat for that rollback 
> instant time.
>  # Release the table lock
>  # Execute the rollback plan and complete the rollback instant. Regardless of 
> whether this succeeds or fails with an exception, close the heartbeat. This 
> increases the chance that the next job that tries to call this rollback API 
> will follow through with the rollback and not abort due to an active previous 
> heartbeat
>  
>  * These steps will only be enforced for  skipLocking=false, since if  
> skipLocking=true then that means the caller may already be explicitly holding 
> a table lock. In this case, acquiring the lock again in step (1) will fail.
>  * Acquiring a lock and reloading timeline for (1-3) will guard against data 
> race conditions where another job calls this rollback API at same time and 
> schedules its own rollback plan and instant. This is since if no rollback has 
> been attempted before for this instant, then before step (1), there is a 
> window of time where another concurrent rollback job could have scheduled a 
> rollback plan, failed execution, and cleaned up heartbeat, all while the 
> current rollback job is running. As a result, even if the current job was 
> passed in an empty pending rollback plan, it still needs to check the active 
> timeline to ensure that no new rollback pending instant has been created. 
>  * Using a heartbeat will signal to other callers in other jobs that there is 
> another job already executing this rollback. Checking for expired heartbeat 
> and (re)-starting the heartbeat has to be done under a lock, so that multiple 
> jobs don't each start it at the same time and assume that they are the only 
> ones that are heartbeating. 
>  * The table lock is no longer needed after (5), since it can now be safely 
> assumed that no other job (calling this rollback API) will execute this 
> rollback instant. 
> One example implementation to achieve this:
>  
> {code:java}
> @Deprecated
> public boolean rollback(final Str

[GitHub] [hudi] danny0405 merged pull request #9464: [MINOR] StreamerUtil#getTableConfig should check whether hoodie.properties exists



danny0405 merged PR #9464:
URL: https://github.com/apache/hudi/pull/9464


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [MINOR] StreamerUtil#getTableConfig should check whether hoodie.properties exists (#9464)

2023-08-17 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new ba5ab8ca468 [MINOR] StreamerUtil#getTableConfig should check whether 
hoodie.properties exists (#9464)
ba5ab8ca468 is described below

commit ba5ab8ca46863a67023e7172fb16a9a36d3b5acb
Author: Nicholas Jiang 
AuthorDate: Fri Aug 18 10:03:12 2023 +0800

[MINOR] StreamerUtil#getTableConfig should check whether hoodie.properties 
exists (#9464)
---
 .../hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java
 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java
index 4912c0abf03..842e732abd4 100644
--- 
a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java
+++ 
b/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java
@@ -312,7 +312,7 @@ public class StreamerUtil {
 FileSystem fs = FSUtils.getFs(basePath, hadoopConf);
 Path metaPath = new Path(basePath, HoodieTableMetaClient.METAFOLDER_NAME);
 try {
-  if (fs.exists(metaPath)) {
+  if (fs.exists(new Path(metaPath, 
HoodieTableConfig.HOODIE_PROPERTIES_FILE))) {
 return Option.of(new HoodieTableConfig(fs, metaPath.toString(), null, 
null));
   }
 } catch (IOException e) {

[hudi] branch master updated: [HUDI-6476][FOLLOW-UP] Path filter by FileStatus to avoid additional fs request (#9366)

2023-08-17 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 7fbf7a36690 [HUDI-6476][FOLLOW-UP] Path filter by FileStatus to avoid 
additional fs request (#9366)
7fbf7a36690 is described below

commit 7fbf7a366900536053c4333dc7d6f4d0ad9b06b4
Author: Wechar Yu 
AuthorDate: Fri Aug 18 09:43:48 2023 +0800

[HUDI-6476][FOLLOW-UP] Path filter by FileStatus to avoid additional fs 
request (#9366)
---
 .../metadata/FileSystemBackedTableMetadata.java| 95 ++
 1 file changed, 41 insertions(+), 54 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
 
b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
index b4a4da01977..8ea9861734a 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
@@ -54,6 +54,7 @@ import java.util.List;
 import java.util.Map;
 import java.util.concurrent.CopyOnWriteArrayList;
 import java.util.stream.Collectors;
+import java.util.stream.Stream;
 
 /**
  * Implementation of {@link HoodieTableMetadata} based file-system-backed 
table metadata.
@@ -167,66 +168,52 @@ public class FileSystemBackedTableMetadata extends 
AbstractHoodieTableMetadata {
   // TODO: Get the parallelism from HoodieWriteConfig
   int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
pathsToList.size());
 
-  // List all directories in parallel
+  // List all directories in parallel:
+  // if current dictionary contains PartitionMetadata, add it to result
+  // if current dictionary does not contain PartitionMetadata, add its 
subdirectory to queue to be processed.
   engineContext.setJobStatus(this.getClass().getSimpleName(), "Listing all 
partitions with prefix " + relativePathPrefix);
-  List dirToFileListing = engineContext.flatMap(pathsToList, 
path -> {
+  // result below holds a list of pair. first entry in the pair optionally 
holds the deduced list of partitions.
+  // and second entry holds optionally a directory path to be processed 
further.
+  List, Option>> result = 
engineContext.flatMap(pathsToList, path -> {
 FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
-return Arrays.stream(fileSystem.listStatus(path));
+if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) {
+  return 
Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(),
 path)), Option.empty()));
+}
+return Arrays.stream(fileSystem.listStatus(path))
+.filter(status -> status.isDirectory() && 
!status.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
+.map(status -> Pair.of(Option.empty(), 
Option.of(status.getPath(;
   }, listingParallelism);
   pathsToList.clear();
 
-  // if current dictionary contains PartitionMetadata, add it to result
-  // if current dictionary does not contain PartitionMetadata, add it to 
queue to be processed.
-  int fileListingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, 
dirToFileListing.size());
-  if (!dirToFileListing.isEmpty()) {
-// result below holds a list of pair. first entry in the pair 
optionally holds the deduced list of partitions.
-// and second entry holds optionally a directory path to be processed 
further.
-engineContext.setJobStatus(this.getClass().getSimpleName(), 
"Processing listed partitions");
-List, Option>> result = 
engineContext.map(dirToFileListing, fileStatus -> {
-  FileSystem fileSystem = 
fileStatus.getPath().getFileSystem(hadoopConf.get());
-  if (fileStatus.isDirectory()) {
-if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, 
fileStatus.getPath())) {
-  return 
Pair.of(Option.of(FSUtils.getRelativePartitionPath(dataBasePath.get(), 
fileStatus.getPath())), Option.empty());
-} else if 
(!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) 
{
-  return Pair.of(Option.empty(), Option.of(fileStatus.getPath()));
-}
-  } else if 
(fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX))
 {
-String partitionName = 
FSUtils.getRelativePartitionPath(dataBasePath.get(), 
fileStatus.getPath().getParent());
-return Pair.of(Option.of(partitionName), Option.empty());
-  }
-  return Pair.of(Option.empty(), Option.empty());
-}, fileListingParallelism);
-
-partitionPaths.addAll(result.stream().filter(entry -> 
entry.getKey().isPresent())
-.map(entry -> entry.getKey().get())

[GitHub] [hudi] danny0405 commented on a diff in pull request #9455: [WIP] Connection release fixes for RLI metadata



danny0405 commented on code in PR #9455:
URL: https://github.com/apache/hudi/pull/9455#discussion_r1297898775


##
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java:
##
@@ -204,7 +212,9 @@ protected ClosableIterator 
getIndexedRecordIterator(Schema reader
 } catch (IOException e) {
   throw new HoodieIOException("Instantiation HfileScanner failed for " + 
reader.getHFileInfo().toString());
 }
-return new RecordIterator(scanner, getSchema(), readerSchema);
+RecordIterator iterator = new RecordIterator(scanner, getSchema(), 
readerSchema);
+recordIterators.add(iterator);

Review Comment:
   Shouldn't these iterators be closed by the caller, we should check every 
invocation of the method and make sure the iterator got closed.
   
   Keepping all the references of the iterators is not a elegant way, maybe 
here we want to support multiple iterators on one reader, and want to reuse the 
reader each time, instead we should instantiate a new reader for each iterator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #9455: [WIP] Connection release fixes for RLI metadata



danny0405 commented on code in PR #9455:
URL: https://github.com/apache/hudi/pull/9455#discussion_r1297898775


##
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieAvroHFileReader.java:
##
@@ -204,7 +212,9 @@ protected ClosableIterator 
getIndexedRecordIterator(Schema reader
 } catch (IOException e) {
   throw new HoodieIOException("Instantiation HfileScanner failed for " + 
reader.getHFileInfo().toString());
 }
-return new RecordIterator(scanner, getSchema(), readerSchema);
+RecordIterator iterator = new RecordIterator(scanner, getSchema(), 
readerSchema);
+recordIterators.add(iterator);

Review Comment:
   Shouldn't these iterators be closed by the caller, we should check every 
invocation of the method and make sure the iterator got closed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #9455: [WIP] Connection release fixes for RLI metadata



danny0405 commented on code in PR #9455:
URL: https://github.com/apache/hudi/pull/9455#discussion_r1297898113


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##
@@ -193,15 +212,33 @@ protected  ClosableIterator> 
lookupRecords(List sorte
 blockContentLoc.getContentPositionInLogFile(),
 blockContentLoc.getBlockSize());
 
-final HoodieAvroHFileReader reader =
+HoodieAvroHFileReader reader =
  new HoodieAvroHFileReader(inlineConf, inlinePath, new 
CacheConfig(inlineConf), inlinePath.getFileSystem(inlineConf),
  Option.of(getSchemaFromHeader()));
 
 // Get writer's schema from the header
 final ClosableIterator> recordIterator =
 fullKey ? reader.getRecordsByKeysIterator(sortedKeys, readerSchema) : 
reader.getRecordsByKeyPrefixIterator(sortedKeys, readerSchema);
 
-return new CloseableMappingIterator<>(recordIterator, data -> 
(HoodieRecord) data);
+ClosableIterator> iterator = new 
ClosableIterator>() {
+  @Override
+  public void close() {
+recordIterator.close();
+reader.close();

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #9455: [WIP] Connection release fixes for RLI metadata



danny0405 commented on code in PR #9455:
URL: https://github.com/apache/hudi/pull/9455#discussion_r1297897760


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java:
##
@@ -175,7 +175,26 @@ protected  ClosableIterator> 
deserializeRecords(byte[] conten
 FileSystem fs = FSUtils.getFs(pathForReader.toString(), 
FSUtils.buildInlineConf(getBlockContentLocation().get().getHadoopConf()));
 // Read the content
 HoodieAvroHFileReader reader = new HoodieAvroHFileReader(fs, 
pathForReader, content, Option.of(getSchemaFromHeader()));
-return unsafeCast(reader.getRecordIterator(readerSchema));
+
+ClosableIterator> recordIterator = 
reader.getRecordIterator(readerSchema);
+ClosableIterator> iterator = new 
ClosableIterator>() {
+  @Override
+  public void close() {
+recordIterator.close();
+reader.close();

Review Comment:
   Isn't the `recordIterator.close()` just closing the reader? What nested in 
another iterator?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly



hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1683204951

   
   ## CI report:
   
   * 751b8aca531eb397d30fd95637bcf7a1e97a6c08 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19324)
 
   * a0db166250fe0220494b18b0c0d343d1a3adae7b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19342)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases



danny0405 commented on code in PR #9459:
URL: https://github.com/apache/hudi/pull/9459#discussion_r1297895255


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java:
##
@@ -479,4 +479,19 @@ private Map 
getGroupOffsets(KafkaConsumer consumer, Set

[GitHub] [hudi] danny0405 commented on a diff in pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases



danny0405 commented on code in PR #9459:
URL: https://github.com/apache/hudi/pull/9459#discussion_r1297894861


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java:
##
@@ -479,4 +479,19 @@ private Map 
getGroupOffsets(KafkaConsumer consumer, Set

[GitHub] [hudi] hudi-bot commented on pull request #9422: [HUDI-6681] Ensure MOR Column Stats Index skips reading filegroups correctly



hudi-bot commented on PR #9422:
URL: https://github.com/apache/hudi/pull/9422#issuecomment-1683197113

   
   ## CI report:
   
   * 751b8aca531eb397d30fd95637bcf7a1e97a6c08 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19324)
 
   * a0db166250fe0220494b18b0c0d343d1a3adae7b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] BBency commented on issue #9094: Async Clustering failing with errors for MOR table



BBency commented on issue #9094:
URL: https://github.com/apache/hudi/issues/9094#issuecomment-1683116028

   I was able to make the clustering work on a test job, but it is failing when 
I apply the same clustering configs on the production table. It is failing with 
the error: 
   py4j.protocol.Py4JJavaError: An error occurred while calling o97.sql.
   : org.apache.hudi.exception.HoodieClusteringException: **Clustering failed 
to write to files:**3b43f625-3095-4834-ab45-beade1dbbfa5-0
at 
org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:381)
at 
org.apache.hudi.client.SparkRDDWriteClient.completeTableService(SparkRDDWriteClient.java:468)
   What parameters should I consider while specifying the values for 
hoodie.clustering.plan.strategy.max.num.groups, 
hoodie.clustering.plan.strategy.small.file.limit, 
hoodie.clustering.plan.strategy.target.file.max.bytes and 
hoodie.clustering.plan.strategy.max.bytes.per.group. Can your provide some 
guidance on the same please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario



hudi-bot commented on PR #9468:
URL: https://github.com/apache/hudi/pull/9468#issuecomment-1683059173

   
   ## CI report:
   
   * ac44e8c1ee6266c53a613ec96dbd89a7223da4c7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19341)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-4756) Clean up usages of "assume.date.partition" config within hudi



[ 
https://issues.apache.org/jira/browse/HUDI-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755708#comment-17755708
 ] 

Lin Liu commented on HUDI-4756:
---

Talked with [~shivnarayan] offline, who confirmed that this task has been there 
for a while, and it is a bit tricky to figure out if this configuration has 
been used in prod. Therefore, we will check the impact of this configuration in 
few cases in next step:
 # Set it to false for non-date-partitioned table, and then set it to true.
 # Set it to true for date-partitioned table and then set it to false.
 # Set it to false for date-partitioned table and then set it to true.
 # Set it to true for date-partitioned table and then set it to false.

> Clean up usages of "assume.date.partition" config within hudi
> -
>
> Key: HUDI-4756
> URL: https://issues.apache.org/jira/browse/HUDI-4756
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: configs
>Reporter: sivabalan narayanan
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> looks like "assume.date.partition" is not used anywhere within hudi. lets 
> clean up the usages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files



 [ 
https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu updated HUDI-6712:
--
Status: In Progress  (was: Open)

> Implement optimized keyed lookup on parquet files
> -
>
> Key: HUDI-6712
> URL: https://issues.apache.org/jira/browse/HUDI-6712
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> Parquet performs poorly when performing a lookup of specific records, based 
> on a single key lookup column. 
> e.g: select * from parquet where key in ("a","b", "c) (SQL)
> e.g: List lookup(parquetFile, Set keys) (code) 
> Let's implement a reader, that is optimized for this pattern, by scanning 
> least amount of data. 
> Requirements: 
> 1. Need to support multiple values for same key. 
> 2. Can assume the file is sorted by the key/lookup field. 
> 3. Should handle non-existence of keys.
> 4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
> minimize read read. 
> 5. Must to the minimum about of RPC calls to cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] kazdy closed pull request #7547: [DOCS] add DROP TABLE, TRUNCATE TABLE docs to spark quick start guide, minor syntax fixes to ALTER TABLE docs



kazdy closed pull request #7547: [DOCS] add DROP TABLE, TRUNCATE TABLE docs to 
spark quick start guide,  minor syntax fixes to ALTER TABLE docs
URL: https://github.com/apache/hudi/pull/7547


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] kazdy closed pull request #7548: [DOCS] fix when I click on Update or MergeInto link in spark quickstart it d…



kazdy closed pull request #7548: [DOCS] fix when I click on Update or MergeInto 
link in spark quickstart it d…
URL: https://github.com/apache/hudi/pull/7548


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Comment Edited] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs

2023-08-17 Thread Krishen Bhan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751866#comment-17751866
 ] 

Krishen Bhan edited comment on HUDI-6596 at 8/17/23 7:47 PM:
-

I was going to create my PR [https://github.com/kbuci/hudi/pull/2] for this 
change on the hudi repo, but realized there was an issue since the assumptions 
made in the rollback implementation (both the existing one and my proposed 
change) where 
{{org.apache.hudi.client.BaseHoodieTableServiceClient#rollback(org.apache.hudi.table.HoodieTable,
 java.lang.String, 
org.apache.hudi.common.util.Option,
 java.lang.String)}}  is inconsistent with the changes here in 
[https://github.com/apache/hudi/pull/8849]
Specifically,  
{{org.apache.hudi.client.BaseHoodieTableServiceClient#rollback(org.apache.hudi.table.HoodieTable,
 java.lang.String, 
org.apache.hudi.common.util.Option,
 java.lang.String)}} seems to have been implemented (base on code and comments) 
under the assumption that a rollback operation will delete all instant files 
from {{commit instant to rollback}} before completing the rollback operation 
itself, which is what I had thought when I was working on my rollback fix(es). 
But it seems that after [https://github.com/apache/hudi/pull/8849] this is 
(retroactively) incorrect as now we are deleting instant files from {{commit 
instant to rollback}} after completing the rollback instant, leaving rollback 
operation as a special type of case where it is possible for rollback instant 
to be complete even if the actual rollback operation has not fully completed 
(due to failing after completing the rollback instant but before cleaning up 
instant files of `{{{}commit instant to rollback{}}} ). Although 
[https://github.com/apache/hudi/pull/8849] handles this by delegating the 
deleting of instant files from `{{{}commit instant to rollback{}}} to some 
clean rollbackFailedWrites operation,  I think the intention/invariants/rules 
of how rollback operates is a bit ambiguous to me and something that should be 
reconciled. To further add complexity, it seems that based on 
[https://github.com/kbuci/hudi/blob/35be9bbbc7ef7ae6ad0a4955da78da4c0463074f/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L630]
 it is also currently legal to remove a request rollback plan, in other words 
"rolling back" a pending rollback plan.

Also after taking another look at the reason for 
[https://github.com/apache/hudi/pull/8849] I think the fix there can be 
reverted and handled alternatively, since it seems to me that fixing/finding  
bugs with getPendingRollbackInfo and preventing concurrent rollback 
scheduling/execution might prevent the underlying issue/reason for PR in the 
first place


was (Author: JIRAUSER301521):
I was going to create my PR [https://github.com/kbuci/hudi/pull/2] for this 
change on the hudi repo, but realized there was an issue since the assumptions 
made in the rollback implementation (both the existing one and my proposed 
change) where 
{{org.apache.hudi.client.BaseHoodieTableServiceClient#rollback(org.apache.hudi.table.HoodieTable,
 java.lang.String, 
org.apache.hudi.common.util.Option,
 java.lang.String)}}  is inconsistent with the changes here in 
[https://github.com/apache/hudi/pull/8849]
Specifically,  
{{org.apache.hudi.client.BaseHoodieTableServiceClient#rollback(org.apache.hudi.table.HoodieTable,
 java.lang.String, 
org.apache.hudi.common.util.Option,
 java.lang.String)}} seems to have been implemented (base on code and comments) 
under the assumption that a rollback operation will delete all instant files 
from {{commit instant to rollback}} before completing the rollback operation 
itself, which is what I had thought when I was working on my rollback fix(es). 
But it seems that after [https://github.com/apache/hudi/pull/8849] this is 
(retroactively) incorrect as now we are deleting instant files from {{commit 
instant to rollback}} after completing the rollback instant, leaving rollback 
operation as a special type of case where it is possible for rollback instant 
to be complete even if the actual rollback operation has not fully completed 
(due to failing after completing the rollback instant but before cleaning up 
instant files of `{{{}commit instant to rollback{}}} ). Although 
[https://github.com/apache/hudi/pull/8849] handles this by delegating the 
deleting of instant files from `{{{}commit instant to rollback{}}} to some 
clean rollbackFailedWrites operation,  I think the intention/invariants/rules 
of how rollback operates is a bit ambiguous to me and something that should be 
reconciled.

Also after taking another look at the reason for 
[https://github.com/apache/hudi/pull/8849] I think the fix there can be 
reverted and handled alternatively, since it seems to me that fixing/finding  
bugs with getPendingRollbackInfo and prevent

[GitHub] [hudi] praneethh opened a new issue, #9469: [SUPPORT] Exception when using MERGE INTO



praneethh opened a new issue, #9469:
URL: https://github.com/apache/hudi/issues/9469

   I'm trying to use merge into and perform partial update on the target data 
but getting the following error:
   
   ```
   java.lang.UnsupportedOperationException: MERGE INTO TABLE is not supported 
temporarily.
 at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:718)
 at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
 at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
 at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
 at 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:67)
 at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
   ```
   
   Steps to reproduce:
   
   1) Load the target table
   
   ```
val df = Seq(("1","neo","2023-08-04 12:00:00","2023-08-04 
12:00:00","2023-08-04")).toDF("emp_id", "emp_name", "log_ts", "load_ts", 
"log_dt")
   
   
df.select(col("emp_id").cast("int"),col("emp_name").cast("string"),col("log_ts").cast("timestamp"),col("load_ts").cast("timestamp"),col("log_dt").cast("date"))
   
   res0.write.format("hudi")
   .option("hoodie.payload.ordering.field", "load_ts")
   .option("hoodie.datasource.write.recordkey.field", "emp_id")
   .option("hoodie.datasource.write.partitionpath.field", "log_dt")
   .option("hoodie.index.type","GLOBAL_SIMPLE")
   .option("hoodie.table.name", "hudi_test")
   .option("hoodie.simple.index.update.partition.path", "false")
   .option("hoodie.datasource.write.precombine.field", "load_ts")
   
.option("hoodie.datasource.write.payload.class","org.apache.hudi.common.model.PartialUpdateAvroPayload")
   .option("hoodie.datasource.write.reconcile.schema","true")
   .option("hoodie.schema.on.read.enable","true")
   .option("hoodie.datasource.write.hive_style_partitioning", "true")
   .option("hoodie.datasource.write.row.writer.enable","false")
   .option("hoodie.datasource.hive_sync.enable","true")
   .option("hoodie.datasource.hive_sync.database","pharpan")
   .option("hoodie.datasource.hive_sync.table", "hudi_test")
   .option("hoodie.datasource.hive_sync.partition_fields", "partitionId")
   .option("hoodie.datasource.hive_sync.ignore_exceptions", "true")
   .option("hoodie.datasource.hive_sync.mode", "hms")
   .option("hoodie.datasource.hive_sync.use_jdbc", "false")
   .option("hoodie.datasource.write.operation","upsert")
   .mode("append")
   .save("gs://sample_bucket/hudi_sample_output_data") 
   ```
   
   2) Load the incremental data
   
   ```
   
   val df2 = Seq(("1","neo","2023-08-05 14:00:00","2023-08-04 
12:00:00","2023-08-05"),("2","trinity","2023-08-05 14:00:00","2023-08-05 
15:00:00","2023-08-05")).toDF("emp_id", "emp_name", "log_ts","load_ts","log_dt")
   
   
df2.select(col("emp_id").cast("int"),col("emp_name").cast("string"),col("log_ts").cast("timestamp"),col("load_ts").cast("timestamp"),col("log_dt").cast("date"))
   
   res2.createOrReplaceTempView("incremental_data")
   
   ```
   
   3) Perform merge
   
   ```
   val sqlPartialUpdate =
  s"""
| merge into pharpan.hudi_test as target
| using (
   |   select * from incremental_data
   | ) source
   | on  target.emp_id = source.emp_id
   | when matched then
   |   update set target.log_ts = source.log_ts, target.log_dt = 
source.log_dt
   | when not matched then insert *
   """.stripMargin
   
   spark.sql(sqlPartialUpdate)
   
   ```
   
   Hudi verison: 0.13.1
   Using "org.apache.hudi.common.model.PartialUpdateAvroPayload" for partial 
update.
   
   Can someone please help in resolving this error? Also, please share the 
documentation on using MERGE INTO if I'm using it in the wrong way.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases



hudi-bot commented on PR #9459:
URL: https://github.com/apache/hudi/pull/9459#issuecomment-1682849353

   
   ## CI report:
   
   * 170678f0e7c429406a4565d85e77367908c1fb4b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19340)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs

2023-08-17 Thread Krishen Bhan (Jira)

[ 
https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755672#comment-17755672
 ] 

Krishen Bhan commented on HUDI-6596:

{quote}  I think we should use a different name to skipLocking, if acquiring 
the lock is skipped because we already acquired the lock, then we should use a 
different variable something like isLockAcquired or something. 
{quote}
I think that was the convention I noticed, but sure I can address that once I 
post the PR for review, thanks!
{quote}
Without complicating the rollback logic, let us see all the cases where we use 
rollback.
1. Rollback failed writes: Lock has to be acquired until scheduling the 
rollback plans for pending instantsToRollback and for execution it need not 
acquire a lock.
2. Rollback a specific instant: Only schedule step needs to be under a lock.
3. Restore operation: Entire operation needs to be under a lock. 

For rollbackFailedWrites method, break it down to two Stages

Stage 1: Scheduling stage

Step 1: Acquire lock and reload active timeline
Step 2: getInstantsToRollback
Step 3: removeInflightFilesAlreadyRolledBack
Step 4: getPendingRollbackInfos
Step 5: Use existing plan or schedule rollback
Step 6: Release lock

Stage 2: Execution stage

Step 7: Check if heartbeat exist for pending rollback plan. If yes abort else 
start an heartbeat and proceed further for executing it.
{quote} 
For now in this ticket the intention is to just focus on (2) `Rollback a 
specific instant:` . Depending on how this implementation goes, I think we 
could follow your approach for (1) `rollbackFailedWrites` when I create a 
ticket to address that. Sorry, I should rename the name of this JIRA ticket to 
clarify that.

{quote}
Rollback operation are not that common. We only do rollback if something fails. 
So, it is not like .clean or .commit operations. So, we should be ok in seeing 
some noise.
{quote}
The issue is that although the chance of an individual job transiently failing 
on a upsert is low, as we add more concurrent writers to our pool of upsert 
jobs on a dataset, the chance that at least one upsert job will fail increases. 
In addition there is the case of underlying infrastructure (like Spark/YARN) 
service degradations (that we've seen internally in our organization) it's 
possible that all writers might fail during an upsert/rollback in the same 
window of time. This means that we should try to gracefully/resiliently account 
for a chance that there is a concurrent rollback going on during a job's upsert 
operation, or even a concurrent rollback that itself has failed. Although 
locking the table during a rollback is out of the question, we can still go 
with an approach like I suggested in 
https://issues.apache.org/jira/browse/HUDI-6596?focusedCommentId=17751201&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17751201
 , to greatly reduce the chance that some sporadic rollback/failures will cause 
all concurrent upsert jobs to fail.

>  Propose rollback implementation changes to guard against concurrent jobs
> -
>
> Key: HUDI-6596
> URL: https://issues.apache.org/jira/browse/HUDI-6596
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: Krishen Bhan
>Priority: Trivial
>
> h1. Issue
> The existing rollback API in 0.14 
> [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877]
>  executes a rollback plan, either taking in an existing rollback plan 
> provided by the caller for a previous rollback or attempt, or scheduling a 
> new rollback instant if none is provided. Currently it is not safe for two 
> concurrent jobs to call this API (when skipLocking=False and the callers 
> aren't already holding a lock), as this can lead to an issue where multiple 
> rollback requested plans are created or two jobs are executing the same 
> rollback instant at the same time.
> h1. Proposed change
> One way to resolve this issue is to refactor this rollback function such that 
> if skipLocking=false, the following steps are followed
>  # Acquire the table lock
>  # Reload the active timeline
>  # Look at the active timeline to see if there is a inflight rollback instant 
> from a previous rollback attempt, if it exists then assign this is as the 
> rollback plan to execute. Also, check if a pending rollback plan was passed 
> in by caller. Then it executes the following steps depending on whether the 
> caller passed a pending rollback instant plan.
>  ##  [a] If a pending inflight rollback plan was passed in by caller, then 
> check that there is a previous attempted rollback instant on timeline (and 
> that the instant times match) and continue to use this rollback plan.

[jira] [Commented] (HUDI-4756) Clean up usages of "assume.date.partition" config within hudi



[ 
https://issues.apache.org/jira/browse/HUDI-4756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755664#comment-17755664
 ] 

Lin Liu commented on HUDI-4756:
---

[~shivnarayan], would you please shed some light on the background of this 
task, e.g., why isn't  this configuration used? how to check if all the usage 
of this configuration has been removed in our products? Thanks. 

> Clean up usages of "assume.date.partition" config within hudi
> -
>
> Key: HUDI-4756
> URL: https://issues.apache.org/jira/browse/HUDI-4756
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: configs
>Reporter: sivabalan narayanan
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> looks like "assume.date.partition" is not used anywhere within hudi. lets 
> clean up the usages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario



hudi-bot commented on PR #9468:
URL: https://github.com/apache/hudi/pull/9468#issuecomment-1682773010

   
   ## CI report:
   
   * ac44e8c1ee6266c53a613ec96dbd89a7223da4c7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19341)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario



hudi-bot commented on PR #9468:
URL: https://github.com/apache/hudi/pull/9468#issuecomment-1682762105

   
   ## CI report:
   
   * ac44e8c1ee6266c53a613ec96dbd89a7223da4c7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6718) Concurrent cleaner commit same instance conflict



 [ 
https://issues.apache.org/jira/browse/HUDI-6718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-6718:
--
Status: Patch Available  (was: In Progress)

> Concurrent cleaner commit same instance conflict 
> -
>
> Key: HUDI-6718
> URL: https://issues.apache.org/jira/browse/HUDI-6718
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning, multi-writer, table-service
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Timeline 
>  
> {code:java}
> -rw-r--r--   1 jon  wheel 0B Aug 16 19:58 
> 20230816195843234.commit.requested
> -rw-r--r--   1 jon  wheel 0B Aug 16 19:58 
> 20230816195845557.commit.requested
> -rw-r--r--   1 jon  wheel   2.2K Aug 16 19:58 20230816195843234.inflight
> -rw-r--r--   1 jon  wheel   813B Aug 16 19:58 20230816195845557.inflight
> -rw-r--r--   1 jon  wheel   2.6K Aug 16 19:58 20230816195845557.commit
> -rw-r--r--   1 jon  wheel   2.6K Aug 16 19:58 20230816195843234.commit
> -rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 
> 20230816195855285.clean.requested
> -rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 20230816195855285.clean.inflight
> -rw-r--r--   1 jon  wheel   1.8K Aug 16 19:58 
> 20230816195855389.clean.requested
> -rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 20230816195855285.clean {code}
> requests:
> {code:java}
> avrocat hudi/output/.hoodie/20230816195855285.clean.requested
> {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": 
> "20230816195654386", "action": "commit", "state": "COMPLETED"}}, 
> "lastCompletedCommitTimestamp": "20230816195845557", "policy": 
> "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, 
> "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": 
> {"1970/01/01": [{"filePath": {"string": 
> "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"},
>  "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": 
> {"array": []}} {code}
> {code:java}
> avrocat hudi/output/.hoodie/20230816195855389.clean.requested 
> {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": 
> "20230816195704584", "action": "commit", "state": "COMPLETED"}}, 
> "lastCompletedCommitTimestamp": "20230816195845557", "policy": 
> "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, 
> "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": 
> {"1970/01/01": [{"filePath": {"string": 
> "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"},
>  "isBootstrapBaseFile": {"boolean": false}}], "1970/01/20": [{"filePath": 
> {"string": 
> "file:/tmp/hudi/output/1970/01/20/05942caf-2d53-4345-845c-5e42abaca797-0_0-1454-2121_20230816195635690.parquet"},
>  "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": 
> {"array": []}}
> {code}
> Console output:
> notice transaction starts twice for the same instance
> {code:java}
> 424775 [pool-75-thread-1] INFO  
> org.apache.hudi.table.action.clean.CleanActionExecutor [] - Finishing 
> previously unfinished cleaner 
> instant=[==>20230816195855285__clean__INFLIGHT__20230816195855525]
> 424775 [pool-75-thread-1] INFO  
> org.apache.hudi.table.action.clean.CleanActionExecutor [] - Using 
> cleanerParallelism: 1
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline [] - Loaded 
> instants upto : 
> Option{val=[==>20230816195855389__clean__REQUESTED__20230816195855634]}
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.TransactionManager [] - Transaction 
> starting for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest 
> completed transaction instant Optional.empty
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.lock.LockManager [] - LockProvider 
> org.apache.hudi.client.transaction.lock.InProcessLockProvider
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> file:/tmp/hudi/output, Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 0, 
> Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRING
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> file:/tmp/hudi/output, Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 1, 
> Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRED
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.TransactionManager [] - Transaction 
> started for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest 
> complet

[jira] [Updated] (HUDI-6718) Concurrent cleaner commit same instance conflict



 [ 
https://issues.apache.org/jira/browse/HUDI-6718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-6718:
--
Status: In Progress  (was: Open)

> Concurrent cleaner commit same instance conflict 
> -
>
> Key: HUDI-6718
> URL: https://issues.apache.org/jira/browse/HUDI-6718
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning, multi-writer, table-service
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Timeline 
>  
> {code:java}
> -rw-r--r--   1 jon  wheel 0B Aug 16 19:58 
> 20230816195843234.commit.requested
> -rw-r--r--   1 jon  wheel 0B Aug 16 19:58 
> 20230816195845557.commit.requested
> -rw-r--r--   1 jon  wheel   2.2K Aug 16 19:58 20230816195843234.inflight
> -rw-r--r--   1 jon  wheel   813B Aug 16 19:58 20230816195845557.inflight
> -rw-r--r--   1 jon  wheel   2.6K Aug 16 19:58 20230816195845557.commit
> -rw-r--r--   1 jon  wheel   2.6K Aug 16 19:58 20230816195843234.commit
> -rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 
> 20230816195855285.clean.requested
> -rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 20230816195855285.clean.inflight
> -rw-r--r--   1 jon  wheel   1.8K Aug 16 19:58 
> 20230816195855389.clean.requested
> -rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 20230816195855285.clean {code}
> requests:
> {code:java}
> avrocat hudi/output/.hoodie/20230816195855285.clean.requested
> {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": 
> "20230816195654386", "action": "commit", "state": "COMPLETED"}}, 
> "lastCompletedCommitTimestamp": "20230816195845557", "policy": 
> "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, 
> "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": 
> {"1970/01/01": [{"filePath": {"string": 
> "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"},
>  "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": 
> {"array": []}} {code}
> {code:java}
> avrocat hudi/output/.hoodie/20230816195855389.clean.requested 
> {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": 
> "20230816195704584", "action": "commit", "state": "COMPLETED"}}, 
> "lastCompletedCommitTimestamp": "20230816195845557", "policy": 
> "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, 
> "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": 
> {"1970/01/01": [{"filePath": {"string": 
> "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"},
>  "isBootstrapBaseFile": {"boolean": false}}], "1970/01/20": [{"filePath": 
> {"string": 
> "file:/tmp/hudi/output/1970/01/20/05942caf-2d53-4345-845c-5e42abaca797-0_0-1454-2121_20230816195635690.parquet"},
>  "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": 
> {"array": []}}
> {code}
> Console output:
> notice transaction starts twice for the same instance
> {code:java}
> 424775 [pool-75-thread-1] INFO  
> org.apache.hudi.table.action.clean.CleanActionExecutor [] - Finishing 
> previously unfinished cleaner 
> instant=[==>20230816195855285__clean__INFLIGHT__20230816195855525]
> 424775 [pool-75-thread-1] INFO  
> org.apache.hudi.table.action.clean.CleanActionExecutor [] - Using 
> cleanerParallelism: 1
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline [] - Loaded 
> instants upto : 
> Option{val=[==>20230816195855389__clean__REQUESTED__20230816195855634]}
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.TransactionManager [] - Transaction 
> starting for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest 
> completed transaction instant Optional.empty
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.lock.LockManager [] - LockProvider 
> org.apache.hudi.client.transaction.lock.InProcessLockProvider
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> file:/tmp/hudi/output, Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 0, 
> Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRING
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> file:/tmp/hudi/output, Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 1, 
> Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRED
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.TransactionManager [] - Transaction 
> started for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest 
> completed transact

[jira] [Updated] (HUDI-6718) Concurrent cleaner commit same instance conflict



 [ 
https://issues.apache.org/jira/browse/HUDI-6718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6718:
-
Labels: pull-request-available  (was: )

> Concurrent cleaner commit same instance conflict 
> -
>
> Key: HUDI-6718
> URL: https://issues.apache.org/jira/browse/HUDI-6718
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning, multi-writer, table-service
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
>
> Timeline 
>  
> {code:java}
> -rw-r--r--   1 jon  wheel 0B Aug 16 19:58 
> 20230816195843234.commit.requested
> -rw-r--r--   1 jon  wheel 0B Aug 16 19:58 
> 20230816195845557.commit.requested
> -rw-r--r--   1 jon  wheel   2.2K Aug 16 19:58 20230816195843234.inflight
> -rw-r--r--   1 jon  wheel   813B Aug 16 19:58 20230816195845557.inflight
> -rw-r--r--   1 jon  wheel   2.6K Aug 16 19:58 20230816195845557.commit
> -rw-r--r--   1 jon  wheel   2.6K Aug 16 19:58 20230816195843234.commit
> -rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 
> 20230816195855285.clean.requested
> -rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 20230816195855285.clean.inflight
> -rw-r--r--   1 jon  wheel   1.8K Aug 16 19:58 
> 20230816195855389.clean.requested
> -rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 20230816195855285.clean {code}
> requests:
> {code:java}
> avrocat hudi/output/.hoodie/20230816195855285.clean.requested
> {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": 
> "20230816195654386", "action": "commit", "state": "COMPLETED"}}, 
> "lastCompletedCommitTimestamp": "20230816195845557", "policy": 
> "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, 
> "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": 
> {"1970/01/01": [{"filePath": {"string": 
> "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"},
>  "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": 
> {"array": []}} {code}
> {code:java}
> avrocat hudi/output/.hoodie/20230816195855389.clean.requested 
> {"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": 
> "20230816195704584", "action": "commit", "state": "COMPLETED"}}, 
> "lastCompletedCommitTimestamp": "20230816195845557", "policy": 
> "KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, 
> "version": {"int": 2}, "filePathsToBeDeletedPerPartition": {"map": 
> {"1970/01/01": [{"filePath": {"string": 
> "file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"},
>  "isBootstrapBaseFile": {"boolean": false}}], "1970/01/20": [{"filePath": 
> {"string": 
> "file:/tmp/hudi/output/1970/01/20/05942caf-2d53-4345-845c-5e42abaca797-0_0-1454-2121_20230816195635690.parquet"},
>  "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": 
> {"array": []}}
> {code}
> Console output:
> notice transaction starts twice for the same instance
> {code:java}
> 424775 [pool-75-thread-1] INFO  
> org.apache.hudi.table.action.clean.CleanActionExecutor [] - Finishing 
> previously unfinished cleaner 
> instant=[==>20230816195855285__clean__INFLIGHT__20230816195855525]
> 424775 [pool-75-thread-1] INFO  
> org.apache.hudi.table.action.clean.CleanActionExecutor [] - Using 
> cleanerParallelism: 1
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline [] - Loaded 
> instants upto : 
> Option{val=[==>20230816195855389__clean__REQUESTED__20230816195855634]}
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.TransactionManager [] - Transaction 
> starting for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest 
> completed transaction instant Optional.empty
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.lock.LockManager [] - LockProvider 
> org.apache.hudi.client.transaction.lock.InProcessLockProvider
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> file:/tmp/hudi/output, Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 0, 
> Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRING
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
> file:/tmp/hudi/output, Lock Instance 
> java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 1, 
> Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRED
> 424779 [pool-91-thread-1] INFO  
> org.apache.hudi.client.transaction.TransactionManager [] - Transaction 
> started for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest 
> completed tra

[GitHub] [hudi] jonvex opened a new pull request, #9468: [HUDI-6718] Check Timeline Before Transitioning Inflight Clean in Multiwriter Scenario



jonvex opened a new pull request, #9468:
URL: https://github.com/apache/hudi/pull/9468

   ### Change Logs
   
   If two cleans start at nearly the same time, they will both attempt to 
execute the same clean instances. This does not cause any data corruption, but 
will cause a writer to fail when they attempt to create the commit in the 
timeline. This is because the commit will have already been written by the 
first writer. Now, we check the timeline before transitioning state.
   
   ### Impact
   
   No writers will fail in this scenario now
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6718) Concurrent cleaner commit same instance conflict



 [ 
https://issues.apache.org/jira/browse/HUDI-6718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-6718:
--
Description: 
Timeline 

 
{code:java}
-rw-r--r--   1 jon  wheel 0B Aug 16 19:58 20230816195843234.commit.requested
-rw-r--r--   1 jon  wheel 0B Aug 16 19:58 20230816195845557.commit.requested
-rw-r--r--   1 jon  wheel   2.2K Aug 16 19:58 20230816195843234.inflight
-rw-r--r--   1 jon  wheel   813B Aug 16 19:58 20230816195845557.inflight
-rw-r--r--   1 jon  wheel   2.6K Aug 16 19:58 20230816195845557.commit
-rw-r--r--   1 jon  wheel   2.6K Aug 16 19:58 20230816195843234.commit
-rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 20230816195855285.clean.requested
-rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 20230816195855285.clean.inflight
-rw-r--r--   1 jon  wheel   1.8K Aug 16 19:58 20230816195855389.clean.requested
-rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 20230816195855285.clean {code}
requests:
{code:java}
avrocat hudi/output/.hoodie/20230816195855285.clean.requested
{"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": 
"20230816195654386", "action": "commit", "state": "COMPLETED"}}, 
"lastCompletedCommitTimestamp": "20230816195845557", "policy": 
"KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, "version": 
{"int": 2}, "filePathsToBeDeletedPerPartition": {"map": {"1970/01/01": 
[{"filePath": {"string": 
"file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"},
 "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": 
{"array": []}} {code}
{code:java}
avrocat hudi/output/.hoodie/20230816195855389.clean.requested 
{"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": 
"20230816195704584", "action": "commit", "state": "COMPLETED"}}, 
"lastCompletedCommitTimestamp": "20230816195845557", "policy": 
"KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, "version": 
{"int": 2}, "filePathsToBeDeletedPerPartition": {"map": {"1970/01/01": 
[{"filePath": {"string": 
"file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"},
 "isBootstrapBaseFile": {"boolean": false}}], "1970/01/20": [{"filePath": 
{"string": 
"file:/tmp/hudi/output/1970/01/20/05942caf-2d53-4345-845c-5e42abaca797-0_0-1454-2121_20230816195635690.parquet"},
 "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": 
{"array": []}}
{code}
Console output:

notice transaction starts twice for the same instance
{code:java}
424775 [pool-75-thread-1] INFO  
org.apache.hudi.table.action.clean.CleanActionExecutor [] - Finishing 
previously unfinished cleaner 
instant=[==>20230816195855285__clean__INFLIGHT__20230816195855525]
424775 [pool-75-thread-1] INFO  
org.apache.hudi.table.action.clean.CleanActionExecutor [] - Using 
cleanerParallelism: 1
424779 [pool-91-thread-1] INFO  
org.apache.hudi.common.table.timeline.HoodieActiveTimeline [] - Loaded instants 
upto : Option{val=[==>20230816195855389__clean__REQUESTED__20230816195855634]}
424779 [pool-91-thread-1] INFO  
org.apache.hudi.client.transaction.TransactionManager [] - Transaction starting 
for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest completed 
transaction instant Optional.empty
424779 [pool-91-thread-1] INFO  
org.apache.hudi.client.transaction.lock.LockManager [] - LockProvider 
org.apache.hudi.client.transaction.lock.InProcessLockProvider
424779 [pool-91-thread-1] INFO  
org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
file:/tmp/hudi/output, Lock Instance 
java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 0, 
Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRING
424779 [pool-91-thread-1] INFO  
org.apache.hudi.client.transaction.lock.InProcessLockProvider [] - Base Path 
file:/tmp/hudi/output, Lock Instance 
java.util.concurrent.locks.ReentrantReadWriteLock@78f60539[Write locks = 1, 
Read locks = 0], Thread pool-91-thread-1, In-process lock state ACQUIRED
424779 [pool-91-thread-1] INFO  
org.apache.hudi.client.transaction.TransactionManager [] - Transaction started 
for Option{val=[==>20230816195855285__clean__INFLIGHT]} with latest completed 
transaction instant Optional.empty {code}
The following pr exposed the issue

[https://github.com/apache/hudi/pull/8602]

This does not cause data corruption. Writer needs to be restarted

  was:
Timeline 

 
{code:java}
-rw-r--r--   1 jon  wheel 0B Aug 16 19:58 20230816195843234.commit.requested
-rw-r--r--   1 jon  wheel 0B Aug 16 19:58 20230816195845557.commit.requested
-rw-r--r--   1 jon  wheel   2.2K Aug 16 19:58 20230816195843234.inflight
-rw-r--r--   1 jon  wheel   813B Aug 16 19:58 20230816195845557.inflight
-rw-r--r--   1 jon  wheel   2.6K Aug 16 19:58 20230816195845557.commit
-rw-r--r--   1 jon  wheel   2.6K Aug 16 19:58 20230816195843234.commit
-rw-r-

[jira] [Created] (HUDI-6718) Concurrent cleaner commit same instance conflict

Jonathan Vexler created HUDI-6718:
-

 Summary: Concurrent cleaner commit same instance conflict 
 Key: HUDI-6718
 URL: https://issues.apache.org/jira/browse/HUDI-6718
 Project: Apache Hudi
  Issue Type: Bug
  Components: cleaning, multi-writer, table-service
Reporter: Jonathan Vexler
Assignee: Jonathan Vexler


Timeline 

 
{code:java}
-rw-r--r--   1 jon  wheel 0B Aug 16 19:58 20230816195843234.commit.requested
-rw-r--r--   1 jon  wheel 0B Aug 16 19:58 20230816195845557.commit.requested
-rw-r--r--   1 jon  wheel   2.2K Aug 16 19:58 20230816195843234.inflight
-rw-r--r--   1 jon  wheel   813B Aug 16 19:58 20230816195845557.inflight
-rw-r--r--   1 jon  wheel   2.6K Aug 16 19:58 20230816195845557.commit
-rw-r--r--   1 jon  wheel   2.6K Aug 16 19:58 20230816195843234.commit
-rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 20230816195855285.clean.requested
-rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 20230816195855285.clean.inflight
-rw-r--r--   1 jon  wheel   1.8K Aug 16 19:58 20230816195855389.clean.requested
-rw-r--r--   1 jon  wheel   1.7K Aug 16 19:58 20230816195855285.clean {code}
requests:

 
{code:java}
avrocat hudi/output/.hoodie/20230816195855285.clean.requested
{"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": 
"20230816195654386", "action": "commit", "state": "COMPLETED"}}, 
"lastCompletedCommitTimestamp": "20230816195845557", "policy": 
"KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, "version": 
{"int": 2}, "filePathsToBeDeletedPerPartition": {"map": {"1970/01/01": 
[{"filePath": {"string": 
"file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"},
 "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": 
{"array": []}} {code}
{code:java}
avrocat hudi/output/.hoodie/20230816195855389.clean.requested 
{"earliestInstantToRetain": {"HoodieActionInstant": {"timestamp": 
"20230816195704584", "action": "commit", "state": "COMPLETED"}}, 
"lastCompletedCommitTimestamp": "20230816195845557", "policy": 
"KEEP_LATEST_COMMITS", "filesToBeDeletedPerPartition": {"map": {}}, "version": 
{"int": 2}, "filePathsToBeDeletedPerPartition": {"map": {"1970/01/01": 
[{"filePath": {"string": 
"file:/tmp/hudi/output/1970/01/01/f66cf644-9e9f-477f-863c-eb62d1c6b14d-0_0-1391-2009_20230816195619275.parquet"},
 "isBootstrapBaseFile": {"boolean": false}}], "1970/01/20": [{"filePath": 
{"string": 
"file:/tmp/hudi/output/1970/01/20/05942caf-2d53-4345-845c-5e42abaca797-0_0-1454-2121_20230816195635690.parquet"},
 "isBootstrapBaseFile": {"boolean": false}}]}}, "partitionsToBeDeleted": 
{"array": []}}
{code}
The following pr exposed the issue

[https://github.com/apache/hudi/pull/8602]

This does not cause data corruption. Writer needs to be restarted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9467: [HUDI-6717] Fix downgrade handler for 0.14.0



hudi-bot commented on PR #9467:
URL: https://github.com/apache/hudi/pull/9467#issuecomment-1682688492

   
   ## CI report:
   
   * 2ade66c64355778bea62ef8ef81c80b929f50b3f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19339)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Comment Edited] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs

2023-08-17 Thread Surya Prasanna Yalla (Jira)

[ 
https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755645#comment-17755645
 ] 

Surya Prasanna Yalla edited comment on HUDI-6596 at 8/17/23 5:20 PM:
-

I think we should use a different name to skipLocking, if acquiring the lock is 
skipped because we already acquired the lock, then we should use a different 
variable something like isLockAcquired or something.

Without complicating the rollback logic, let us see all the cases where we use 
rollback.
1. Rollback failed writes: Lock has to be acquired until scheduling the 
rollback plans for pending instantsToRollback and for execution it need not 
acquire a lock.
2. Rollback a specific instant: Only schedule step needs to be under a lock.
3. Restore operation: Entire operation needs to be under a lock. 

For rollbackFailedWrites method, break it down to two Stages

*Stage 1: Scheduling stage*

Step 1: Acquire lock and reload active timeline
Step 2: getInstantsToRollback
Step 3: removeInflightFilesAlreadyRolledBack
Step 4: getPendingRollbackInfos
Step 5: Use existing plan or schedule rollback
Step 6: Release lock

*Stage 2: Execution stage*

Step 7: Check if heartbeat exist for pending rollback plan. If yes abort else 
start an heartbeat and proceed further for executing it.

Rollback operation are not that common. We only do rollback if something fails. 
So, it is not like .clean or .commit operations. So, we should be ok in seeing 
some noise.

was (Author: suryaprasanna):
I think we should use a different name to skipLocking, if we are acquiring the 
lock is skipped because we already acquired the lock, then we should use a 
different variable something like isLockAcquired or something.

Without complicating the rollback logic, let us see all the cases where we use 
rollback.
1. Rollback failed writes: Lock has to be acquired until scheduling the 
rollback plans for pending instantsToRollback and for execution it need not 
acquire a lock.
2. Rollback a specific instant: Only schedule step needs to be under a lock.
3. Restore operation: Entire operation needs to be under a lock. 

For rollbackFailedWrites method, break it down to two Stages

*Stage 1: Scheduling stage*

Step 1: Acquire lock and reload active timeline
Step 2: getInstantsToRollback
Step 3: removeInflightFilesAlreadyRolledBack
Step 4: getPendingRollbackInfos
Step 5: Use existing plan or schedule rollback
Step 6: Release lock

*Stage 2: Execution stage*

Step 7: Check if heartbeat exist for pending rollback plan. If yes abort else 
start an heartbeat and proceed further for executing it.

Rollback operation are not that common. We only do rollback if something fails. 
So, it is not like .clean or .commit operations. So, we should be ok in seeing 
some noise.

>  Propose rollback implementation changes to guard against concurrent jobs
> -
>
> Key: HUDI-6596
> URL: https://issues.apache.org/jira/browse/HUDI-6596
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: Krishen Bhan
>Priority: Trivial
>
> h1. Issue
> The existing rollback API in 0.14 
> [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877]
>  executes a rollback plan, either taking in an existing rollback plan 
> provided by the caller for a previous rollback or attempt, or scheduling a 
> new rollback instant if none is provided. Currently it is not safe for two 
> concurrent jobs to call this API (when skipLocking=False and the callers 
> aren't already holding a lock), as this can lead to an issue where multiple 
> rollback requested plans are created or two jobs are executing the same 
> rollback instant at the same time.
> h1. Proposed change
> One way to resolve this issue is to refactor this rollback function such that 
> if skipLocking=false, the following steps are followed
>  # Acquire the table lock
>  # Reload the active timeline
>  # Look at the active timeline to see if there is a inflight rollback instant 
> from a previous rollback attempt, if it exists then assign this is as the 
> rollback plan to execute. Also, check if a pending rollback plan was passed 
> in by caller. Then it executes the following steps depending on whether the 
> caller passed a pending rollback instant plan.
>  ##  [a] If a pending inflight rollback plan was passed in by caller, then 
> check that there is a previous attempted rollback instant on timeline (and 
> that the instant times match) and continue to use this rollback plan. If that 
> isn't the case, then raise a rollback exception since this means another job 
> has concurrently already executed this plan. Note that in a valid HUDI 
> dataset there can be at most one

[jira] [Commented] (HUDI-6596) Propose rollback implementation changes to guard against concurrent jobs

2023-08-17 Thread Surya Prasanna Yalla (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755645#comment-17755645
 ] 

Surya Prasanna Yalla commented on HUDI-6596:


I think we should use a different name to skipLocking, if we are acquiring the 
lock is skipped because we already acquired the lock, then we should use a 
different variable something like isLockAcquired or something.
 
Without complicating the rollback logic, let us see all the cases where we use 
rollback.
1. Rollback failed writes: Lock has to be acquired until scheduling the 
rollback plans for pending instantsToRollback and for execution it need not 
acquire a lock.
2. Rollback a specific instant: Only schedule step needs to be under a lock.
3. Restore operation: Entire operation needs to be under a lock. 
 
For rollbackFailedWrites method, break it down to two Stages
 
*Stage 1: Scheduling stage*
 
Step 1: Acquire lock and reload active timeline
Step 2: getInstantsToRollback
Step 3: removeInflightFilesAlreadyRolledBack
Step 4: getPendingRollbackInfos
Step 5: Use existing plan or schedule rollback
Step 6: Release lock
 
*Stage 2: Execution stage*
 
Step 7: Check if heartbeat exist for pending rollback plan. If yes abort else 
start an heartbeat and proceed further for executing it.
 
Rollback operation are not that common. We only do rollback if something fails. 
So, it is not like .clean or .commit operations. So, we should be ok in seeing 
some noise.

>  Propose rollback implementation changes to guard against concurrent jobs
> -
>
> Key: HUDI-6596
> URL: https://issues.apache.org/jira/browse/HUDI-6596
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: Krishen Bhan
>Priority: Trivial
>
> h1. Issue
> The existing rollback API in 0.14 
> [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java#L877]
>  executes a rollback plan, either taking in an existing rollback plan 
> provided by the caller for a previous rollback or attempt, or scheduling a 
> new rollback instant if none is provided. Currently it is not safe for two 
> concurrent jobs to call this API (when skipLocking=False and the callers 
> aren't already holding a lock), as this can lead to an issue where multiple 
> rollback requested plans are created or two jobs are executing the same 
> rollback instant at the same time.
> h1. Proposed change
> One way to resolve this issue is to refactor this rollback function such that 
> if skipLocking=false, the following steps are followed
>  # Acquire the table lock
>  # Reload the active timeline
>  # Look at the active timeline to see if there is a inflight rollback instant 
> from a previous rollback attempt, if it exists then assign this is as the 
> rollback plan to execute. Also, check if a pending rollback plan was passed 
> in by caller. Then it executes the following steps depending on whether the 
> caller passed a pending rollback instant plan.
>  ##  [a] If a pending inflight rollback plan was passed in by caller, then 
> check that there is a previous attempted rollback instant on timeline (and 
> that the instant times match) and continue to use this rollback plan. If that 
> isn't the case, then raise a rollback exception since this means another job 
> has concurrently already executed this plan. Note that in a valid HUDI 
> dataset there can be at most one rollback instant for a corresponding commit 
> instant, which is why if we no longer see a pending rollback in timeline in 
> this phase we can safely assume that it had already been executed to 
> completion.
>  ##  [b] If no pending inflight rollback plan was passed in by caller and no 
> pending rollback instant was found in timeline earlier, then schedule a new 
> rollback plan
>  # Now that a rollback plan and requested rollback instant time has been 
> assigned, check for an active heartbeat for the rollback instant time. If 
> there is one, then abort the rollback as that means there is a concurrent job 
> executing that rollback. If not, then start a heartbeat for that rollback 
> instant time.
>  # Release the table lock
>  # Execute the rollback plan and complete the rollback instant. Regardless of 
> whether this succeeds or fails with an exception, close the heartbeat. This 
> increases the chance that the next job that tries to call this rollback API 
> will follow through with the rollback and not abort due to an active previous 
> heartbeat
>  
>  * These steps will only be enforced for  skipLocking=false, since if  
> skipLocking=true then that means the caller may already be explicitly holding 
> a table lock. In this case, acquiring the lock again in step (1) will fail.
>  * Acquiring a lock and reloading timelin

[jira] [Updated] (HUDI-6701) Explore use of UUID-6/7 as a replacement for current auto generated keys



 [ 
https://issues.apache.org/jira/browse/HUDI-6701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu updated HUDI-6701:
--
Status: In Progress  (was: Open)

> Explore use of UUID-6/7 as a replacement for current auto generated keys
> 
>
> Key: HUDI-6701
> URL: https://issues.apache.org/jira/browse/HUDI-6701
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> Today, we auto generate string keys of the form 
> (HoodieRecord#generateSequenceId), which is highly compressible, esp compared 
> to uuidv1, when we store as a string column inside a parquet file.
> {code:java}
>   public static String generateSequenceId(String instantTime, int 
> partitionId, long recordIndex) {
> return instantTime + "_" + partitionId + "_" + recordIndex;
>   }
> {code}
> As a part of this task, we'd love to understand if 
> - Can uuid6 or 7, provide similar compressed storage footprint when written 
> as a column in a parquet file. 
> - can the current format be represented as a 160-bit number i.e 2 longs, 1 
> int in storage? would that save us further in storage costs?  
> (Orthogonal consideration is the memory needed to hold the key string, which 
> can be higher than a 160bits. We can discuss this later, once we understand 
> storage footprint) 
>  
> Resources:
> * https://datatracker.ietf.org/doc/draft-ietf-uuidrev-rfc4122bis/09/ 
> * https://github.com/uuid6/uuid6-ietf-draft
> * https://github.com/uuid6/prototypes 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases



hudi-bot commented on PR #9459:
URL: https://github.com/apache/hudi/pull/9459#issuecomment-1682500472

   
   ## CI report:
   
   * 768e40ce1d035a021d88e5409f92bab846e4e4c0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19333)
 
   * 170678f0e7c429406a4565d85e77367908c1fb4b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19340)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9459: [HUDI-6683][FOLLOW-UP] Json & Avro Kafka Source Minor Refactor & Added null Kafka Key test cases



hudi-bot commented on PR #9459:
URL: https://github.com/apache/hudi/pull/9459#issuecomment-1682486781

   
   ## CI report:
   
   * 768e40ce1d035a021d88e5409f92bab846e4e4c0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19333)
 
   * 170678f0e7c429406a4565d85e77367908c1fb4b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] linliu-code commented on pull request #9466: [HUDI-4756] Remove unused config "hoodie.assume.date.partitioning"



linliu-code commented on PR #9466:
URL: https://github.com/apache/hudi/pull/9466#issuecomment-1682464922

   > What do you mean for unused ?
   
   This is the from the task. I don't have enough context to confirm this. 
@nsivabalan Can you explain this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] haitham-eltaweel commented on issue #9460: Not valid month error when pulling new data from Oracle DB using HoodieDeltaStreamer



haitham-eltaweel commented on issue #9460:
URL: https://github.com/apache/hudi/issues/9460#issuecomment-1682454495

   it is timestamp type
   
   > @haitham-eltaweel What date format is `MODIFID_DT` present in oracle? what 
is the datatype?
   
   it is timestamp type


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9408: [HUDI-6671] Support 'alter table add partition' sql



hudi-bot commented on PR #9408:
URL: https://github.com/apache/hudi/pull/9408#issuecomment-1682372687

   
   ## CI report:
   
   * fadda82b0444d09d8718bc9002fbd1964e18bbf2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19332)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19338)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6242) Format changes for Hudi 1.X release line

[
https://issues.apache.org/jira/browse/HUDI-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinoth Chandar updated HUDI-6242:
-
Description:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
- *Timeline* : active or archived timeline contents, file names.
- {*}Base Files{*}: file format versions, any changes to any data types, file
footers, file names.
- {*}Log Files{*}: Block structure, content, names.
- {*}Metadata Table{*}: (should we call this index table instead?) partition
names, number of file groups, key/value schema and metadata to MDT row mappings.
- {*}Table properties{*}: What's written to hoodie.properties.
- *Marker files* : Can be left to the writer implementation.

h2. Change summary:

The following functionality should be supportable by the new format tech specs
(at a minimum)

Flexibility :
- Ability to mix different types of base files within a single table or even a
single file group (e.g images, json, vectors ...)
- Easy integration of metadata for JVM and non-jvm clients

Metafields :
- Should _recordkey be uuid special handling?

Additional Info:
- Support encoding of watermarks/event time fields as first class citizen, for
handling late arriving data.
- Position based skipping of base file
- Additional metadata to avoid more RPCs to scan base file/log blocks.
- ML/Column family use-case?
- Support having changeset of columns in each write, other headers

Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.

Table organization:
- Support different logical partitions on the same data
- Storage of table spread across buckets/root folders
- Decouple table location from timeline, metadata. They can all be in
different places

Concurrency/Timeline:
- Ability to support general purpose multi-table transactions, esp between
data and metadata tables.
- Support lockless/non-blocking transactions, where writers don't block each
other even in face of conflicts.
- Support for long lived instants in timeline, break down distinction between
active/archived
- Support checking of uniqueness constraints, even in face of two concurrent
insert transactions.
- Support precise time-travel queries
- Support time-travel writes.
- Support schema history tracking and aid in schema evol impl.
- TrueTime store/support for instant times
- No more separate rollback action. make it a new state.

Metadata table :
- Encode filegroup ID and commit time along with file metadata

Table Properties:
- Partitioning information/indexing info

was:
This EPIC tracks changes to the Hudi storage format.
Format change is anything that changes any bits related to
- *Timeline* : active or archived timeline contents, file names.
- {*}Base Files{*}: file format versions, any changes to any data types, file
footers, file names.
- {*}Log Files{*}: Block structure, content, names.
- {*}Metadata Table{*}: (should we call this index table instead?) partition
names, number of file groups, key/value schema and metadata to MDT row mappings.
- {*}Table properties{*}: What's written to hoodie.properties.
- *Marker files* : Can be left to the writer implementation.

h2. Change summary:

The following functionality should be supportable by the new format tech specs
(at a minimum)

Metafields :
- Should _recordkey be uuid special handling?

Log :
- Support writing updates as deletes and inserts, instead of logging as update
to base file.
- CDC format is GA.

[GitHub] [hudi] hudi-bot commented on pull request #9467: [HUDI-6717] Fix downgrade handler for 0.14.0



hudi-bot commented on PR #9467:
URL: https://github.com/apache/hudi/pull/9467#issuecomment-1682297929

   
   ## CI report:
   
   * 2ade66c64355778bea62ef8ef81c80b929f50b3f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19339)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9466: [HUDI-4756] Remove unused config "hoodie.assume.date.partitioning"



hudi-bot commented on PR #9466:
URL: https://github.com/apache/hudi/pull/9466#issuecomment-1682297848

   
   ## CI report:
   
   * d61eae7b243d92629914d2b95637922db6be3b08 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=19337)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #9319: [SUPPORT] how to use HiveSyncConfig instead of hive configs in DataSourceWriteOptions object



ad1happy2go commented on issue #9319:
URL: https://github.com/apache/hudi/issues/9319#issuecomment-1682287275

   @zlinsc Yes, We were standardising configs for new release and 
HoodieSyncConfig should be used for all meta sync related configuration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #9467: [HUDI-6717] Fix downgrade handler for 0.14.0



hudi-bot commented on PR #9467:
URL: https://github.com/apache/hudi/pull/9467#issuecomment-1682284936

   
   ## CI report:
   
   * 2ade66c64355778bea62ef8ef81c80b929f50b3f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lokeshj1703 commented on pull request #9467: [HUDI-6717] Fix downgrade handler for 0.14.0



lokeshj1703 commented on PR #9467:
URL: https://github.com/apache/hudi/pull/9467#issuecomment-1682243995

   @nsivabalan @codope Please review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-6717) Fix downgrade handler for 0.14.0



 [ 
https://issues.apache.org/jira/browse/HUDI-6717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-6717:
-
Labels: pull-request-available  (was: )

> Fix downgrade handler for 0.14.0
> 
>
> Key: HUDI-6717
> URL: https://issues.apache.org/jira/browse/HUDI-6717
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> Since the log block version (due to delete block change) has been upgraded in 
> 0.14.0, the delete blocks can not be read in 0.13.0 or earlier.
> Similarly the addition of record level index field in metadata table leads to 
> column drop error on downgrade. The Jira aims to fix the downgrade handler to 
> trigger compaction and delete metadata table if user wishes to downgrade from 
> version six (0.14.0) to version 5 (0.13.0).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-6717) Fix downgrade handler for 0.14.0

2023-08-17 Thread Lokesh Jain (Jira)

Lokesh Jain created HUDI-6717:
-

 Summary: Fix downgrade handler for 0.14.0
 Key: HUDI-6717
 URL: https://issues.apache.org/jira/browse/HUDI-6717
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain
Assignee: Lokesh Jain


Since the log block version (due to delete block change) has been upgraded in 
0.14.0, the delete blocks can not be read in 0.13.0 or earlier.
Similarly the addition of record level index field in metadata table leads to 
column drop error on downgrade. The Jira aims to fix the downgrade handler to 
trigger compaction and delete metadata table if user wishes to downgrade from 
version six (0.14.0) to version 5 (0.13.0).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-4631) Enhance retries for failed writes w/ write conflicts in a multi writer scenarios



 [ 
https://issues.apache.org/jira/browse/HUDI-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-4631:
-

Assignee: Sagar Sumit  (was: sivabalan narayanan)

> Enhance retries for failed writes w/ write conflicts in a multi writer 
> scenarios
> 
>
> Key: HUDI-4631
> URL: https://issues.apache.org/jira/browse/HUDI-4631
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: multi-writer
>Reporter: sivabalan narayanan
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> lets say there are two writers from t0 to t5. so hudi fails w2 and succeeds 
> w1. and user restarts w2 and for next 5 mins, lets say there are no other 
> overlapping writers. So the same write from w2 will now succeed. so, whenever 
> there is a write conflict and pipeline fails, all user needs to do is, just 
> restart the pipeline or retry to ingest the same batch.
>  
> Ask: can we add retries within hudi during such failures. Anyways, in most 
> cases, users just restart the pipeline in such cases. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4631) Enhance retries for failed writes w/ write conflicts in a multi writer scenarios