[ https://issues.apache.org/jira/browse/HUDI-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raymond Xu updated HUDI-3355: ----------------------------- Status: In Progress (was: Open) > Issue with out of order commits in the timeline when ingestion writers using > SparkAllowUpdateStrategy > ----------------------------------------------------------------------------------------------------- > > Key: HUDI-3355 > URL: https://issues.apache.org/jira/browse/HUDI-3355 > Project: Apache Hudi > Issue Type: Bug > Reporter: Surya Prasanna Yalla > Assignee: tao meng > Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > Out of order commits can happen between two commits C1 and C2. If timestamp > of C2 is greater than C1's and completed before C1. > In our use case, we are running clustering in async, and want ingestion > writers to be given preference over clustering. > Following are the configs used by the ingestion writer > {noformat} > "hoodie.clustering.updates.strategy": > "org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy" > "hoodie.clustering.rollback.pending.replacecommit.on.conflict": > false{noformat} > This would allow ingestion writers to ignore pending replacecommits on the > timeline and continue writing. > > Consider the following scenario > {code:java} > At instant1 > C1.commit > C2.commit > C3.replacecommit.inflight > C4.inflight -> Started > {code} > {code:java} > At instant2 > C1.commit > C2.commit > C3.replacecommit.inflight > C4.commit -> Completed {code} > {code:java} > At instant3 > C1.commit > C2.commit > C3.replacecommit.inflight > C4.commit (lastSuccessfulCommit seen by C5) > C5.inflight -> Started{code} > {code:java} > At instant4 > C1.commit > C2.commit > C3.replacecommit -> Completed > C4.commit (lastSuccessfulCommit seen by C5) > C5.inflight(continuing) {code} > {code:java} > At instant5 > C1.commit > C2.commit > C3.replacecommit > C4.commit (lastSuccessfulCommit seen by C5) > C5.commit -> Completed (It has conflict with C3 but since it has lower > timestamp than C4, C3 is not considered during conflict resolution){code} > > Here, the lastSuccessfulCommit value that is seen by C5 is C4, even though > the C3 is the one that is committed last. > Ideally when sorting the timeline we should consider the transition times. > So, timeline should look something like, > {code:java} > C1.commit > C2.commit > C4.commit(lastSuccessfulCommit seen by C5) > C3.replacecommit > C5.inflight{code} > So, in this case when the C5 is about to complete, it will consider all the > commits that are completed after C4 which will be C3. -- This message was sent by Atlassian Jira (v8.20.1#820001)