[ 
https://issues.apache.org/jira/browse/HUDI-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3355:
-----------------------------
    Status: In Progress  (was: Open)

> Issue with out of order commits in the timeline when ingestion writers using 
> SparkAllowUpdateStrategy
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-3355
>                 URL: https://issues.apache.org/jira/browse/HUDI-3355
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Surya Prasanna Yalla
>            Assignee: tao meng
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.11.0
>
>
> Out of order commits can happen between two commits C1 and C2. If timestamp 
> of C2 is greater than C1's and completed before C1.
> In our use case, we are running clustering in async, and want ingestion 
> writers to be given preference over clustering.
> Following are the configs used by the ingestion writer
> {noformat}
> "hoodie.clustering.updates.strategy": 
> "org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy"
> "hoodie.clustering.rollback.pending.replacecommit.on.conflict": 
> false{noformat}
> This would allow ingestion writers to ignore pending replacecommits on the 
> timeline and continue writing.
>  
> Consider the following scenario
> {code:java}
> At instant1
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.inflight -> Started 
> {code}
> {code:java}
> At instant2
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.commit -> Completed {code}
> {code:java}
> At instant3
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.inflight -> Started{code}
> {code:java}
> At instant4
> C1.commit
> C2.commit
> C3.replacecommit -> Completed
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.inflight(continuing) {code}
> {code:java}
> At instant5
> C1.commit
> C2.commit
> C3.replacecommit
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.commit -> Completed (It has conflict with C3 but since it has lower 
> timestamp than C4, C3 is not considered during conflict resolution){code}
>  
> Here, the lastSuccessfulCommit value that is seen by C5 is C4, even though 
> the C3 is the one that is committed last.
> Ideally when sorting the timeline we should consider the transition times. 
> So, timeline should look something like,
> {code:java}
> C1.commit
> C2.commit
> C4.commit(lastSuccessfulCommit seen by C5)
> C3.replacecommit 
> C5.inflight{code}
> So, in this case when the C5 is about to complete, it will consider all the 
> commits that are completed after C4 which will be C3.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to