[
https://issues.apache.org/jira/browse/HBASE-27659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nick Dimiduk resolved HBASE-27659.
----------------------------------
Resolution: Fixed
Pushed to branch-2.6+. Thanks a lot for the contribution [~hgromer]
> Incremental backups should re-use splits from last full backup
> --------------------------------------------------------------
>
> Key: HBASE-27659
> URL: https://issues.apache.org/jira/browse/HBASE-27659
> Project: HBase
> Issue Type: Improvement
> Reporter: Bryan Beaudreault
> Assignee: Hernan Gelaf-Romer
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.0.0-beta-2, 2.6.2
>
>
> All incremental backups require a previous full backup. Full backups use
> snapshots + ExportSnapshot, which includes exporting the SnapshotManifest.
> The SnapshotManifest includes all of the regions in the table during the
> snapshot.
> Incremental backups use WALPlayer to turn new HLogs since last backup into
> HFiles. This uses HFileOutputFormat2, which writes HFiles along the split
> boundaries of the tables at the time that it runs.
> Active clusters may have regions split and merge over time, so the split
> boundaries of incremental backup hfiles may not align to the original full
> backup. This means we need to use MapReduceHFileSplitterJob during restore in
> order to read all of the hfiles for all of the incremental backups and
> re-split them based on the restored table.
> * So let's say a cluster with regions A, B, C does a full backup. Data in
> that backup will be segmented into those 3 regions.
> * Over time the cluster splits and merges and we end up with totally
> different regions D, E, F. An incremental backup occurs, and the data will be
> segmented into those 3 regions.Later the cluster splits those 3 regions so we
> end up with new regions G, H, I, J, K, L. Then next incremental backup goes
> with that
> When we go to restore this cluster, it'll pull the full backup and the 2
> incrementals. The full backup will get restored first, so the new table will
> have regions A, B, C. Then all of the hfiles from the incrementals will be
> combined together and run through MapReduceHFileSplitterJob. This will cause
> all of those data files to get re-partitioned based on the A, B, C regions of
> the newly restored table (based on the full backup).
> This splitting process is expensive on a large cluster. We could skip it
> entirely if incremental backups used the RegionInfos from the original full
> backup SnapshotManifest as the splits for WALPlayer. Therefore, all
> incremental backups will use the same splits as the original full backup. The
> resulting hfiles could be directly bulkloaded without any split process,
> reducing cost and time of restore.
> One other benefit is that one could use the combination of a full backup +
> all incremental backups as an input to their own mapreduce job. This
> impossible now because all of the backups will have HFiles with different
> start/end keys which don't align to a common set of splits for combining into
> ClientSideRegionScanner.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)