[ https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
sivabalan narayanan reassigned HUDI-7779: ----------------------------------------- Assignee: sivabalan narayanan > Guarding archival to not archive unintended commits > --------------------------------------------------- > > Key: HUDI-7779 > URL: https://issues.apache.org/jira/browse/HUDI-7779 > Project: Apache Hudi > Issue Type: Bug > Components: archiving > Reporter: sivabalan narayanan > Assignee: sivabalan narayanan > Priority: Major > > Archiving commits from active timeline could lead to data consistency issues > on some rarest of occasions. We should come up with proper guards to ensure > we do not make such unintended archival. > > Major gap which we wanted to guard is: > if someone disabled cleaner, archival should account for data consistency > issues and ensure it bails out. > We have a base guarding condition, where archival will stop at the earliest > commit to retain based on latest clean commit metadata. But there are few > other scenarios that needs to be accounted for. > > a. Keeping aside replace commits, lets dive into specifics for regular > commits and delta commits. > Say user configured clean commits to 4 and archival configs to 5 and 6. after > t10, cleaner is supposed to clean up all file versions created at or before > t6. Say cleaner did not run(for whatever reason for next 5 commits). > Archival will certainly be guarded until earliest commit to retain based > on latest clean commits. > Corner case to consider: > A savepoint was added to say t3 and later removed. and still the cleaner was > never re-enabled. Even though archival would have been stopped at t3 (until > savepoint is present),but once savepoint is removed, if archival is executed, > it could archive commit t3. Which means, file versions tracked at t3 is still > not yet cleaned by the cleaner. > Reasoning: > We are good here wrt data consistency. Up until cleaner runs next time, this > older file versions might be exposed to the end-user. But time travel query > is not intended for already cleaned up commits and hence this is not an > issue. None of snapshot, time travel query or incremental query will run into > issues as they are not supposed to poll for t3. > At any later point, if cleaner is re-enabled, it will take care of cleaning > up file versions tracked at t3 commit. Just that for interim period, some > older file versions might still be exposed to readers. > > b. The more tricky part is when replace commits are involved. Since replace > commit metadata in active timeline is what ensures the replaced file groups > are ignored for reads, before archiving the same, cleaner is expected to > clean them up fully. But are there chances that this could go wrong? > Corner case to consider. Lets add onto above scenario, where t3 has a > savepoint, and t4 is a replace commit which replaced file groups tracked in > t3. > Cleaner will skip cleaning up files tracked by t3(due to the presence of > savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain > will be pointing to t6. And say savepoint for t3 is removed, but cleaner was > disabled. In this state of the timeline, if archival is executed, (since > t3.savepoint is removed), archival might archive t3 and t4.rc. This could > lead to data duplicates as both replaced file groups and new file groups from > t4.rc would be exposed as valid file groups. > > In other words, if we were to summarize the different scenarios: > i. replaced file group is never cleaned up. > - ECTR(Earliest commit to retain) is less than this.rc and we are good. > ii. replaced file group is cleaned up. > - ECTR is > this.rc and is good to archive. > iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full > clean up did not happen. After savepoint is removed, and when archival is > executed, we should avoid archiving the rc of interest. This is the gap we > don't account for as of now. > > We have 3 options to go about to solve this. > Option A: > Let Savepoint deletion flow take care of cleaning up the files its tracking. > cons: > Savepoint's responsibility is not removing any data files. So, from a single > user responsibility rule, this may not be right. Also, this clean up might > need to do what a clean planner might actually be doing. ie. build file > system view, understand if its supposed to be cleaned up already, and then > only clean up the files which are supposed to be cleaned up. For eg, if a > file group has only one file slice, it should not be cleaned up and scenarios > like this. > > Option B: > Since archival is the one which might cause data consistency issues, why not > archival do the clean up. > We need to account for concurrent cleans, failure and retry scenarios etc. > Also, we might need to build the file system view and then take a call > whether something needs to be cleaned up before archiving something. > Cons: > Again, the single user responsibility rule might be broken. Would be neat if > cleaner takes care of deleting data files and archival only takes care of > deleting/archiving timeline files. > > Option C: > Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track > another metadata named "EarliestCommitToArchive". Strictly speaking, earliest > commit to retain is meant for incremental cleaner. Archival polls earliest > commit to retain and additionally savepointed instants to add guard rails. > So, why not cleaner do that work and track "EarliestCommitToArchive". > Essentially the value will be either "EarliestCommitToRetain" or earliest > savepointed commit earlier than EarliestCommitToRetain. > By this way, archival does not need to do any additional polling (savepoint > timeline for instance). It can just guard itself against > "EarliestCommitToArchive". > > Impl nuances: > Cleaner has to track savepointed instants as well to assist in deducing the > "EarliestCommitToArchive". > > Pros: > Archival does not need to build FSV and instead rely on clean commit metadata > -- This message was sent by Atlassian Jira (v8.20.10#820010)