kgeisz opened a new pull request, #7891: URL: https://github.com/apache/hbase/pull/7891
https://issues.apache.org/jira/browse/HBASE-29891 # Key Changes - For continuous incremental backups, the bulk load output directory for WALs-to-HFiles conversions is now a separate directory for each table. - Before: `backupRoot/.tmp/backup_X` -> After: `backupRoot/.tmp/backup_X/namespace/table` - `walToHFiles()` in `IncrementalTableBackupClient.java` now sets `hbase.mapreduce.use.multi.table.hfileoutputformat` to `false` when configuring `WALPlayer` - This same `hbase.mapreduce.use.multi.table.hfileoutputformat` config is also set to `false` when replaying WALs for continuous backups. - Added logic to `WALPlayer` so it does not always use a multi-table HFile output format (regardless of the value of `hbase.mapreduce.use.multi.table.hfileoutputformat`) - Added a unit test for multi-table incremental backup and restore. The test also verifies the integrity of the data after the restore. # Background This pull request fixes an issue where running an incremental backup on multiple tables at once results in a failure. When continuous backup is enabled, an incremental backup firsts convert the WALs to HFiles. These HFiles are output to a `.tmp/backup_X` directory (where `X` is the backup ID). This is known as the "bulk load output directory". After, a `distcp` is performed to copy the temporary backup directory to the actual backup directory. Here is an example file system after the WALs to HFiles conversion and before the `distcp`. The `distcp` is supposed to copy the contents of `backupRoot/.tmp/backup_INCR02` into `backupRoot/backup_INCR02`: ``` backupRoot ├── .tmp │ └── backup_INCR02 │ ├── default │ │ ├── table1 │ │ │ └── cf │ │ └── table2 │ │ └── cf │ └── namespace1 │ ├── table3 │ │ └── cf │ └── table4 │ └── cf ├── backup_FULL01 │ ├── .backup.manifest │ ├── default │ │ ├── table1 │ │ │ └── .hbase-snapshot │ │ └── table2 │ │ └── .hbase-snapshot │ └── namespace1 │ ├── table3 │ │ └── .hbase-snapshot │ └── table4 │ └── .hbase-snapshot └── backup_INCR02 ├── default │ ├── table1 │ │ ├── .tabledesc │ │ └── 8d01b │ └── table2 │ ├── .tabledesc │ └── 5g03w └── namespace1 ├── table3 │ ├── .tabledesc │ └── 1d42g └── table4 ├── .tabledesc └── g49j7 ``` Incremental backups convert WALs to HFiles one table at a time, even if a backup set contains more than one table. When WALs are converted to HFiles, the `WALPlayer` runs and a map-reduce job is performed. The HFiles are sent to a newly created `backupRoot/.tmp/backup_X` directory. The MR job for the first table runs without any issues. The problem occurs during the second MR job. This `backupRoot/.tmp/backup_X` now already exists, which causes the MR job to fail with something like: ``` 2026-02-11T13:54:17,945 ERROR [Time-limited test {}] impl.TableBackupClient(232): Unexpected exception in incremental-backup: incremental copy backup_1770846846624Output directory hdfs://localhost:64120/backupUT/.tmp/backup_1770846846624 already exists ``` # Solution ## Summary This fix changes the bulk load output directory for continuous incremental backups. Since the `WALPlayer` is run individually for each table, each WALs-to-HFiles conversion can be sent to a directory for that specific table. An example bulk load output directory for `table1` in the `default` namespace would be `backupRoot/.tmp/backup_X/default/table1`. Then, `table2` would get its own bulk load output directory, etc. ## Issues Getting the proper bulk load output and getting the `distcp` to run successfully took more effort than expected. Changing the bulk load output directory for each table was simple. The real challenge was getting the HFiles to be output in the proper format. If we set the output directory for `table1` to be `backupRoot/.tmp/backup_X/default/table1`, then the HFiles would instead be output to `backupRoot/.tmp/backup_X/default/table1/default/table1`, where the namespace and table name directories are repeated. This caused the `.tmp` directory structure to look like the following: ``` backupRootDir ├── .tmp │ └── backup_02INCR │ └── default │ ├── table1 │ │ ├── _SUCCESS │ │ └── default │ │ └── table1 │ │ └── cf │ └── table2 │ ├── _SUCCESS │ └── default │ └── table2 │ └── cf ├── backup_01FULL │ ├── .backup.manifest │ └── default │ ├── table1 │ │ └── .hbase-snapshot │ └── table2 │ └── .hbase-snapshot └── backup_02INCR └── default ├── table1 │ ├── .tabledesc │ └── 8d01b └── table2 ├── .tabledesc └── 5g03w ``` Telling `distcp` to copy `backupRoot/.tmp/backup_X` means these "double" `default/table1/default/table1` directories will be incorrectly copied into the backup directory, like so: ``` backup_INCR02 ├── default │ ├── table1 │ │ ├── .tabledesc │ │ ├── 8d01b │ │ └── default │ │ └── table1 │ │ └── cf ``` Telling `distcp` to just copy the deeper `default/table` directories resulted in a failure from `distcp` due to conflicting source directory names. This works if there is only one table in each namespace, but does not work if a namespace has multiple directories. This is because the `distcp` looks as follows: ``` distcp backupRoot/.tmp/backup_X/default/table1/default backupRoot/.tmp/backup_X/default/table1/default <destination> ``` Resulting in an error like: ``` 2026-03-03T09:20:01,847 ERROR [Time-limited test {}] mapreduce.MapReduceBackupCopyJob$BackupDistCp(235): org.apache.hadoop.tools.CopyListing$DuplicateFileException: File hdfs://localhost:60356/backupUT/.tmp/backup_1772558388312/default/table1/default and hdfs://localhost:60356/backupUT/.tmp/backup_1772558388312/default/table2/default would cause duplicates. Aborting ``` Copying just the deeper table name directories results in an improper directory structure in the destination: ``` backup_INCR02 ├── default │ ├── table1 │ │ ├── .tabledesc │ │ └── 8d01b │ └── table2 │ ├── .tabledesc │ └── 5g03w ├── table1 └── table2 ``` Using `-update` in the `distcp` command did not get the desired result either. ## Potential Workaround A workaround for the issues mentioned above would be to run the `distcp` for each namespace. Then, the source directories would be unique table names. However, this means a different `distcp` would need to be performed for each namespace in the backup set. ## The Actual Solution We want the WALs-to-HFiles to output to something like this in `.tmp`: ``` backupRootDir ├── .tmp │ └── backup_02INCR │ ├── default │ │ ├── table1 │ │ │ └── cf │ │ └── table2 │ │ └── cf │ ├── namespace1 │ │ ├── table3 │ │ │ └── cf │ │ └── table4 │ │ └── cf │ └── namespace2 │ ├── table5 │ │ └── cf │ └── table6 │ └── cf ``` In order to get rid of the "double `namespace/tableName`" directory structure, we have to change how the HFiles are output. We want to keep our bulk load output directory as `backupRoot/.tmp/backup_X/namespace/table` and have just the `cf` column family directory sent there, not `namespace/table/cf`. This is done by setting the `hbase.mapreduce.use.multi.table.hfileoutputformat` config key to `false` for continuous incremental backups. The problem here is `WALPlayer.java` was always using `MultiTableHFileOutputFormat`, which implicitly sets `hbase.mapreduce.use.multi.table.hfileoutputformat` to `true`. That's why some changes were made to the logic in `WALPlayer.java`. Also, this `hfileoutputformat` config key needs to be `false` when replaying the WALs during a restore. Otherwise, a failure occurs like the following: ``` 2026-03-05T18:32:55,042 WARN [Thread-1018 {}] mapred.LocalJobRunner$Job(590): job_local1580221296_0005 java.lang.Exception: java.lang.IllegalArgumentException: Invalid format for composite key [rowLoad0]. Cannot extract tablename and suffix from key at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492) ~[hadoop-mapreduce-client-common-3.4.2.jar:?] at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:559) ~[hadoop-mapreduce-client-common-3.4.2.jar:?] Caused by: java.lang.IllegalArgumentException: Invalid format for composite key [rowLoad0]. Cannot extract tablename and suffix from key ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
