[ https://issues.apache.org/jira/browse/HBASE-25784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mallikarjun updated HBASE-25784: -------------------------------- Summary: Support for Parallel Backups enabling multi tenancy with rsgroups (was: Support for Parallel Backups enabling multi tenancy) > Support for Parallel Backups enabling multi tenancy with rsgroups > ----------------------------------------------------------------- > > Key: HBASE-25784 > URL: https://issues.apache.org/jira/browse/HBASE-25784 > Project: HBase > Issue Type: Umbrella > Components: backup&restore > Reporter: Mallikarjun > Assignee: Mallikarjun > Priority: Major > Labels: backup > > *Problem 1:* > With this design, Incremental and Full backup can't be run in parallel and > leading to degraded RPO's in case Full backup is of longer duration esp for > large tables. > > Example: > Expectation: Say you have a big table with 10 TB and your RPO is 60 minutes > and you are allowed to ship the remote backup with 800 Mbps. And you are > allowed to take Full Backups once in a week and rest of them should be > incremental backups > > Shortcoming: With the above design, one can't run parallel backups and > whenever there is a full backup running (which takes roughly 25 hours) you > are not allowed to take incremental backups and that would be a breach in > your RPO. > > *Proposed Solution:* Barring some critical sections such as modifying state > of the backup on meta tables, others can happen parallelly. Leaving > incremental backups to be able to run based on older successful full / > incremental backups and completion time of backup should be used instead of > start time of backup for ordering. I have not worked on the full redesign, > and will be doing so if this proposal seems acceptable for the community. > > *Problem 2:* > With one backup at a time, it fails easily for a multi-tenant system. This > poses following problems > * Admins will not be able to achieve required RPO's for their tables because > of dependence on other tenants present in the system. As one tenant doesn't > have control over other tenants' table sizes and hence the duration of the > backup > * Management overhead of setting up a right sequence to achieve required > RPO's for different tenants could be very hard. > *Proposed Solution:* Same as previous proposal > > *Problem 3:* > Incremental backup works on WAL's and > org.apache.hadoop.hbase.backup.master.BackupLogCleaner ensures that WAL's are > never cleaned up until the next backup (Full / Incremental) is taken. This > poses following problem > * WAL's can grow unbounded in case there are transient problems like backup > site facing issues or anything else until next backup scheduled goes > successful > *Proposed Solution:* I can't think of anything better, but I see this can be > a potential problem. Also, one can force full backup if required WAL files > are missing for whatever other reasons not necessarily mentioned above. > > Proposed Design. > !https://i.ibb.co/vVV1BTs/Backup-Activity-Diagram.png|width=322,height=414! -- This message was sent by Atlassian Jira (v8.3.4#803005)