[ 
https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-7779:
-----------------------------------------

    Assignee: sivabalan narayanan

> Guarding archival to not archive unintended commits
> ---------------------------------------------------
>
>                 Key: HUDI-7779
>                 URL: https://issues.apache.org/jira/browse/HUDI-7779
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: archiving
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>
> Archiving commits from active timeline could lead to data consistency issues 
> on some rarest of occasions. We should come up with proper guards to ensure 
> we do not make such unintended archival. 
>  
> Major gap which we wanted to guard is:
> if someone disabled cleaner, archival should account for data consistency 
> issues and ensure it bails out.
> We have a base guarding condition, where archival will stop at the earliest 
> commit to retain based on latest clean commit metadata. But there are few 
> other scenarios that needs to be accounted for. 
>  
> a. Keeping aside replace commits, lets dive into specifics for regular 
> commits and delta commits.
> Say user configured clean commits to 4 and archival configs to 5 and 6. after 
> t10, cleaner is supposed to clean up all file versions created at or before 
> t6. Say cleaner did not run(for whatever reason for next 5 commits). 
>     Archival will certainly be guarded until earliest commit to retain based 
> on latest clean commits. 
> Corner case to consider: 
> A savepoint was added to say t3 and later removed. and still the cleaner was 
> never re-enabled. Even though archival would have been stopped at t3 (until 
> savepoint is present),but once savepoint is removed, if archival is executed, 
> it could archive commit t3. Which means, file versions tracked at t3 is still 
> not yet cleaned by the cleaner. 
> Reasoning: 
> We are good here wrt data consistency. Up until cleaner runs next time, this 
> older file versions might be exposed to the end-user. But time travel query 
> is not intended for already cleaned up commits and hence this is not an 
> issue. None of snapshot, time travel query or incremental query will run into 
> issues as they are not supposed to poll for t3. 
> At any later point, if cleaner is re-enabled, it will take care of cleaning 
> up file versions tracked at t3 commit. Just that for interim period, some 
> older file versions might still be exposed to readers. 
>  
> b. The more tricky part is when replace commits are involved. Since replace 
> commit metadata in active timeline is what ensures the replaced file groups 
> are ignored for reads, before archiving the same, cleaner is expected to 
> clean them up fully. But are there chances that this could go wrong? 
> Corner case to consider. Lets add onto above scenario, where t3 has a 
> savepoint, and t4 is a replace commit which replaced file groups tracked in 
> t3. 
> Cleaner will skip cleaning up files tracked by t3(due to the presence of 
> savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain 
> will be pointing to t6. And say savepoint for t3 is removed, but cleaner was 
> disabled. In this state of the timeline, if archival is executed, (since 
> t3.savepoint is removed), archival might archive t3 and t4.rc.  This could 
> lead to data duplicates as both replaced file groups and new file groups from 
> t4.rc would be exposed as valid file groups. 
>  
> In other words, if we were to summarize the different scenarios: 
> i. replaced file group is never cleaned up. 
>     - ECTR(Earliest commit to retain) is less than this.rc and we are good. 
> ii. replaced file group is cleaned up. 
>     - ECTR is > this.rc and is good to archive.
> iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full 
> clean up did not happen.  After savepoint is removed, and when archival is 
> executed, we should avoid archiving the rc of interest. This is the gap we 
> don't account for as of now.
>  
> We have 3 options to go about to solve this.
> Option A: 
> Let Savepoint deletion flow take care of cleaning up the files its tracking. 
> cons:
> Savepoint's responsibility is not removing any data files. So, from a single 
> user responsibility rule, this may not be right. Also, this clean up might 
> need to do what a clean planner might actually be doing. ie. build file 
> system view, understand if its supposed to be cleaned up already, and then 
> only clean up the files which are supposed to be cleaned up. For eg, if a 
> file group has only one file slice, it should not be cleaned up and scenarios 
> like this. 
>  
> Option B:
> Since archival is the one which might cause data consistency issues, why not 
> archival do the clean up. 
> We need to account for concurrent cleans, failure and retry scenarios etc. 
> Also, we might need to build the file system view and then take a call 
> whether something needs to be cleaned up before archiving something. 
> Cons:
> Again, the single user responsibility rule might be broken. Would be neat if 
> cleaner takes care of deleting data files and archival only takes care of 
> deleting/archiving timeline files. 
>  
> Option C:
> Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track 
> another metadata named "EarliestCommitToArchive". Strictly speaking, earliest 
> commit to retain is meant for incremental cleaner. Archival polls earliest 
> commit to retain and additionally savepointed instants to add guard rails. 
> So, why not cleaner do that work and track "EarliestCommitToArchive". 
> Essentially the value will be either "EarliestCommitToRetain" or earliest 
> savepointed commit earlier than EarliestCommitToRetain. 
> By this way, archival does not need to do any additional polling (savepoint 
> timeline for instance). It can just guard itself against 
> "EarliestCommitToArchive". 
>  
> Impl nuances: 
> Cleaner has to track savepointed instants as well to assist in deducing the 
> "EarliestCommitToArchive". 
>  
> Pros:
> Archival does not need to build FSV and instead rely on clean commit metadata
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to