[ https://issues.apache.org/jira/browse/HUDI-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rajesh Mahindra updated HUDI-2432: ---------------------------------- Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31 (was: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24) > Fix restore by adding a requested instant and restore plan > ---------------------------------------------------------- > > Key: HUDI-2432 > URL: https://issues.apache.org/jira/browse/HUDI-2432 > Project: Apache Hudi > Issue Type: Task > Reporter: sivabalan narayanan > Assignee: sivabalan narayanan > Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > Fix restore by adding a requested instant and restore plan > > Trying to see if we really need a plan. Dumping my thoughts here. > Restore internally converts to N no of rollbacks. We fetch active instants in > reverse order from timeline and trigger rollbacks 1 by 1. We have already > have a patch fixing rollback to add rollback Plan in rollback.requested meta > file. So, walking through failure scenarios. > > With restore, individual rollbacks are not published to timeline. So, if > restore fails midway, in the 2nd attempt, only subset of rollback will be > applied to metadata table(which got rolledback during the 2nd attempt). so, > we need a plan for restore as well. > But with our enhancement to rollback to publish a plan, Rollback.requested > can't be skipped and we have to publish to timeline. So, here is what will > happen w/o a restore plan. > > start restore > rollback commit N > rollback.requested for commit N// plan. > execute rollback, but do not publish to timeline. so this will not > get applied to metadata table. > rollback commit N-1 > rollback.requested for commit N-1 // plan > execute rollback, but do not publish to timeline. again, will not > get applied to metadata table. > . > commit restore and publish. this will get applied to metadata table. > Once we are done committing restore, we can remove all rollback.requested > files if needed. > > Failure scenarios: > If after 2 rollbacks, we fail. > on re-attempt, we will process remaining commits only, since active timeline > may not report commitN and commitN-1 as active. So, we can do something like > below w/ a restore plan. > > 1. start restore > 2. schedule rollback for all of them. > serialize all commit instants that need to be rolledback along with > the rollback plan. // by now, we would have created rollback.requested meta > file for all commits that need to be rolled back. > 3. now execute rollback one by one. // do not publish to timeline once > done. also changes should not be applied to metadata table. > 4. collect rollback commit metadata from all individual rollbacks and create > the restore commit metadata. there could be some commits which was already > rolledback, and for those, we need to manually create rollback metadata based > on rollback plan. More details in next para. commit the restore and publish. > only this will get applied to metadata table(which inturn will unwrap the > individual rollback metadata and apply it to metadata table). > > Failures: > if we fail after 2nd rollback: > on 2nd attempt, we will look at retstore plan for all commits that needs to > be rolledback. So, we can't really rollback the first 2 since they are > already rolled back. And so, we will manually create rollback metadata from > rollback.requested meta file. and for rest, we will follow the regular flow > of executing actual rollback and collecting rollback metadata. Once complete, > we will serialize all this info in restore metadata which gets applied to > metadata table. > > Alternatives: But since restore anyway is a destructive operation and is > advised to stop all processes, we do have an option to clean up metadata > table and rebootstrap completely once restore is complete. > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)