[ https://issues.apache.org/jira/browse/HUDI-7779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ethan Guo reassigned HUDI-7779: ------------------------------- Assignee: Lokesh Jain (was: sivabalan narayanan) > Guarding archival to not archive unintended commits > --------------------------------------------------- > > Key: HUDI-7779 > URL: https://issues.apache.org/jira/browse/HUDI-7779 > Project: Apache Hudi > Issue Type: Bug > Components: archiving > Reporter: sivabalan narayanan > Assignee: Lokesh Jain > Priority: Major > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > > Archiving commits from active timeline could lead to data consistency issues > on some rarest of occasions. We should come up with proper guards to ensure > we do not make such unintended archival. > > Major gap which we wanted to guard is: > if someone disabled cleaner, archival should account for data consistency > issues and ensure it bails out. > We have a base guarding condition, where archival will stop at the earliest > commit to retain based on latest clean commit metadata. But there are few > other scenarios that needs to be accounted for. > > a. Keeping aside replace commits, lets dive into specifics for regular > commits and delta commits. > Say user configured clean commits to 4 and archival configs to 5 and 6. after > t10, cleaner is supposed to clean up all file versions created at or before > t6. Say cleaner did not run(for whatever reason for next 5 commits). > Archival will certainly be guarded until earliest commit to retain based > on latest clean commits. > Corner case to consider: > A savepoint was added to say t3 and later removed. and still the cleaner was > never re-enabled. Even though archival would have been stopped at t3 (until > savepoint is present),but once savepoint is removed, if archival is executed, > it could archive commit t3. Which means, file versions tracked at t3 is still > not yet cleaned by the cleaner. > Reasoning: > We are good here wrt data consistency. Up until cleaner runs next time, this > older file versions might be exposed to the end-user. But time travel query > is not intended for already cleaned up commits and hence this is not an > issue. None of snapshot, time travel query or incremental query will run into > issues as they are not supposed to poll for t3. > At any later point, if cleaner is re-enabled, it will take care of cleaning > up file versions tracked at t3 commit. Just that for interim period, some > older file versions might still be exposed to readers. > > b. The more tricky part is when replace commits are involved. Since replace > commit metadata in active timeline is what ensures the replaced file groups > are ignored for reads, before archiving the same, cleaner is expected to > clean them up fully. But are there chances that this could go wrong? > Corner case to consider. Lets add onto above scenario, where t3 has a > savepoint, and t4 is a replace commit which replaced file groups tracked in > t3. > Cleaner will skip cleaning up files tracked by t3(due to the presence of > savepoint), but will clean up t4, t5 and t6. So, earliest commit to retain > will be pointing to t6. And say savepoint for t3 is removed, but cleaner was > disabled. In this state of the timeline, if archival is executed, (since > t3.savepoint is removed), archival might archive t3 and t4.rc. This could > lead to data duplicates as both replaced file groups and new file groups from > t4.rc would be exposed as valid file groups. > > In other words, if we were to summarize the different scenarios: > i. replaced file group is never cleaned up. > - ECTR(Earliest commit to retain) is less than this.rc and we are good. > ii. replaced file group is cleaned up. > - ECTR is > this.rc and is good to archive. > iii. tricky: ECTR moved ahead compared to this.rc, but due to savepoint, full > clean up did not happen. After savepoint is removed, and when archival is > executed, we should avoid archiving the rc of interest. This is the gap we > don't account for as of now. > > We have 3 options to go about to solve this. > Option A: > Let Savepoint deletion flow take care of cleaning up the files its tracking. > cons: > Savepoint's responsibility is not removing any data files. So, from a single > user responsibility rule, this may not be right. Also, this clean up might > need to do what a clean planner might actually be doing. ie. build file > system view, understand if its supposed to be cleaned up already, and then > only clean up the files which are supposed to be cleaned up. For eg, if a > file group has only one file slice, it should not be cleaned up and scenarios > like this. > > Option B: > Since archival is the one which might cause data consistency issues, why not > archival do the clean up. > We need to account for concurrent cleans, failure and retry scenarios etc. > Also, we might need to build the file system view and then take a call > whether something needs to be cleaned up before archiving something. > Cons: > Again, the single user responsibility rule might be broken. Would be neat if > cleaner takes care of deleting data files and archival only takes care of > deleting/archiving timeline files. > > Option C: > Similar to how cleaner maintain EarliestCommitToRetain, let cleaner track > another metadata named "EarliestCommitToArchive". Strictly speaking, earliest > commit to retain is meant for incremental cleaner. Archival polls earliest > commit to retain and additionally savepointed instants to add guard rails. > So, why not cleaner do that work and track "EarliestCommitToArchive". > Essentially the value will be either "EarliestCommitToRetain" or earliest > savepointed commit earlier than EarliestCommitToRetain. > By this way, archival does not need to do any additional polling (savepoint > timeline for instance). It can just guard itself against > "EarliestCommitToArchive". > > Impl nuances: > Cleaner has to track savepointed instants as well to assist in deducing the > "EarliestCommitToArchive". > > Pros: > Archival does not need to build FSV and instead rely on clean commit > metadata. > > Irrespective of any approach we take, also thinking if we should unify the > enablement of archival and cleaner together. > For eg, one cannot run just archival after disabling cleaner. Either both > cleaner and archival are enabled or both are disabled. By this way, we do not > let archival run w/o a clean. This atleast reduces the chances of such data > consistencies, but still one of the above approach has to be fixed. Logically > speaking, cleaner and archival should go hand in hand. There is no work for > archival if cleaner has not run. Only if cleaner moved the "earliest commit > to retain" (and earliest commit to archival), archival will have something to > do. If not, very likely its a no-op. so, why not have one config to enable to > disable both. > > More details on how cleaner can track or deduce "EarliestCommitToArchive" > (Option C): > Clean planner will be tracking the current savepointed timestamps as part of > every plan. Also, it will find the difference b/w previously tracked > savepointed timestamp list and ensure it triggers clean and updates > "EarliestCommitToArchive" accordingly. > > Example illustration: > Say this is the timeline. > t20(starting of active timeline) ----- t30 (earliest commit to retain) > ---- t35(latest) > > User added savepoint to t32. and say there are 5 more commits. > next time when cleaner runs: > t20(starting of active timeline) –- t32 (sp) – t35 (earliest commit to > retain) ---- t40(latest) > But cleaner also tracks that "t32" as savepointed timestamp list. > And sets "EarliestCommitToArchive" to "t32". > After archival > t32 (sp) – t35 (earliest commit to retain) ---- t40(latest) > > And if the user deletes the savepoint of interest: > t32(first entry in active timeline) – t35 (earliest commit to retain) ---- > t40(latest) > and 4 new commits are added. > t32(first entry in active timeline) – t35 (earliest commit to retain) ---- > t44(latest) > Cleaner runs: > Clean planner will poll current timeline for savepointed commits and compare > it w/ previously tracked list. > And so it deduces that t32 savepoint has been removed. If archival has not be > run, t32.deltacommit will be activetimeline for sure. And so clean planner > can deduce the partitions for incremental cleaner. > Computes the "EarliestCommittoRetain" to say "t39". > Updates "EarliestCommitToArchive" to also "t39" since there are no savepoints > and archival is good to archive every commit up until t39. > > *Consensus:* > We will go w/ Option C. > > Re-iterating the proposed fix: > *Cleaner:* > a. Cleaner will track savepointed timestamps during every plan. Compared it > with previous clean plan and deduce if any savepoint is removed. If any > removal is deduced, clean planning will happen for all partitions. > b. Cleaner will also track additional metadata called > "EarliestCommitToNotArchive". Whose value will either refer to first > savepoint if first savepoint < earliest commit to retain. Or it will refer to > "earliestCommitToRetain". > *Archival:* > Archival will guard itself against "EarliestCommitToNotArchive" from latest > completed clean. > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)