[ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7779:
----------------------------
    Fix Version/s: 0.16.0
                   1.0.0

> Guarding archival to not archive unintended commits
> ---------------------------------------------------
>
>                 Key: HUDI-7779
>                 URL: https://issues.apache.org/jira/browse/HUDI-7779
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: archiving
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>             Fix For: 0.16.0, 1.0.0
>
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data consistency issues, why not 
> archival do the clean up. 
> We need to account for concurrent cleans, failure and retry scenarios etc. 
> Also, we might need to build the file system view and then take a call 
> whether something needs to be cleaned up before archiving something. 
> Cons:
> Again, the single user responsibility rule might be broken. Would be neat if 
> cleaner takes care of deleting data files and archival only takes care of 
> deleting/archiving timeline files. 
>  
> Option C:
> Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track 
> another metadata named "EarliestCommitToArchive". Strictly speaking, earliest 
> commit to retain is meant for incremental cleaner. Archival polls earliest 
> commit to retain and additionally savepointed instants to add guard rails. 
> So, why not cleaner do that work and track "EarliestCommitToArchive". 
> Essentially the value will be either "EarliestCommitToRetain" or earliest 
> savepointed commit earlier than EarliestCommitToRetain. 
> By this way, archival does not need to do any additional polling (savepoint 
> timeline for instance). It can just guard itself against 
> "EarliestCommitToArchive". 
>  
> Impl nuances: 
> Cleaner has to track savepointed instants as well to assist in deducing the 
> "EarliestCommitToArchive". 
>  
> Pros:
> Archival does not need to build FSV and instead rely on clean commit 
> metadata. 
>  
> Irrespective of any approach we take, also thinking if we should unify the 
> enablement of archival and cleaner together. 
> For eg, one cannot run just archival after disabling cleaner. Either both 
> cleaner and archival are enabled or both are disabled. By this way, we do not 
> let archival run w/o a clean. This atleast reduces the chances of such data 
> consistencies, but still one of the above approach has to be fixed. Logically 
> speaking, cleaner and archival should go hand in hand. There is no work for 
> archival if cleaner has not run. Only if cleaner moved the "earliest commit 
> to retain" (and earliest commit to archival), archival will have something to 
> do. If not, very likely its a no-op. so, why not have one config to enable to 
> disable both. 
>  
> More details on how cleaner can track or deduce "EarliestCommitToArchive" 
> (Option C):
> Clean planner will be tracking the current savepointed timestamps as part of 
> every plan. Also, it will find the difference b/w previously tracked 
> savepointed timestamp list and ensure it triggers clean and updates 
> "EarliestCommitToArchive" accordingly. 
>  
> Example illustration:
> Say this is the timeline. 
>    t20(starting of active timeline) ----- t30 (earliest commit to retain) 
> ---- t35(latest)
>  
> User added savepoint to t32. and say there are 5 more commits. 
> next time when cleaner runs: 
>  t20(starting of active timeline) –- t32 (sp) – t35 (earliest commit to 
> retain) ---- t40(latest)
> But cleaner also tracks that "t32" as savepointed timestamp list. 
> And sets "EarliestCommitToArchive" to "t32". 
> After archival 
> t32 (sp) – t35 (earliest commit to retain) ---- t40(latest)
>  
> And if the user deletes the savepoint of interest:
> t32(first entry in active timeline) – t35 (earliest commit to retain) ---- 
> t40(latest)
> and 4 new commits are added. 
> t32(first entry in active timeline) – t35 (earliest commit to retain) ---- 
> t44(latest)
> Cleaner runs:
> Clean planner will poll current timeline for savepointed commits and compare 
> it w/ previously tracked list. 
> And so it deduces that t32 savepoint has been removed. If archival has not be 
> run, t32.deltacommit will be activetimeline for sure. And so clean planner 
> can deduce the partitions for incremental cleaner. 
> Computes the "EarliestCommittoRetain" to say "t39". 
> Updates "EarliestCommitToArchive" to also "t39" since there are no savepoints 
> and archival is good to archive every commit up until t39. 
>  
>  
> In addition to above proposed changes, we also propose to add 
> savepoint.delete to timeline when savepoint is deleted so that we have a 
> audit log of savepoint being deleted. and it will help cleaner to track and 
> update all the resp metadata (earliest commit to archive, etc) 
>  
> Consensus:
> We will go w/ Option C + add savepoint.delete to timeline. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to