kgeisz opened a new pull request, #7891:
URL: https://github.com/apache/hbase/pull/7891

   https://issues.apache.org/jira/browse/HBASE-29891
   
   # Key Changes
   
   - For continuous incremental backups, the bulk load output directory for 
WALs-to-HFiles conversions is now a separate directory for each table.
     - Before: `backupRoot/.tmp/backup_X` -> After: 
`backupRoot/.tmp/backup_X/namespace/table`
   - `walToHFiles()` in `IncrementalTableBackupClient.java` now sets 
`hbase.mapreduce.use.multi.table.hfileoutputformat` to `false` when configuring 
`WALPlayer`
   - This same `hbase.mapreduce.use.multi.table.hfileoutputformat` config is 
also set to `false` when replaying WALs for continuous backups.
   - Added logic to `WALPlayer` so it does not always use a multi-table HFile 
output format (regardless of the value of 
`hbase.mapreduce.use.multi.table.hfileoutputformat`)
   - Added a unit test for multi-table incremental backup and restore.  The 
test also verifies the integrity of the data after the restore.
   
   # Background
   
   This pull request fixes an issue where running an incremental backup on 
multiple tables at once results in a failure.  When continuous backup is 
enabled, an incremental backup firsts convert the WALs to HFiles.  These HFiles 
are output to a `.tmp/backup_X` directory (where `X` is the backup ID).  This 
is known as the "bulk load output directory".  After, a `distcp` is performed 
to copy the temporary backup directory to the actual backup directory.
   
   Here is an example file system after the WALs to HFiles conversion and 
before the `distcp`. The `distcp` is supposed to copy the contents of 
`backupRoot/.tmp/backup_INCR02` into `backupRoot/backup_INCR02`:
   
   ```
   backupRoot
   ├── .tmp
   │   └── backup_INCR02
   │       ├── default
   │       │   ├── table1
   │       │   │   └── cf
   │       │   └── table2
   │       │       └── cf
   │       └── namespace1
   │           ├── table3
   │           │   └── cf
   │           └── table4
   │               └── cf
   ├── backup_FULL01
   │   ├── .backup.manifest
   │   ├── default
   │   │   ├── table1
   │   │   │   └── .hbase-snapshot
   │   │   └── table2
   │   │       └── .hbase-snapshot
   │   └── namespace1
   │       ├── table3
   │       │   └── .hbase-snapshot
   │       └── table4
   │           └── .hbase-snapshot
   └── backup_INCR02
       ├── default
       │   ├── table1
       │   │   ├── .tabledesc
       │   │   └── 8d01b
       │   └── table2
       │       ├── .tabledesc
       │       └── 5g03w
       └── namespace1
           ├── table3
           │   ├── .tabledesc
           │   └── 1d42g
           └── table4
               ├── .tabledesc
               └── g49j7
   ```
   
   Incremental backups convert WALs to HFiles one table at a time, even if a 
backup set contains more than one table.  When WALs are converted to HFiles, 
the `WALPlayer` runs and a map-reduce job is performed.  The HFiles are sent to 
a newly created `backupRoot/.tmp/backup_X` directory.  The MR job for the first 
table runs without any issues.  The problem occurs during the second MR job.  
This `backupRoot/.tmp/backup_X` now already exists, which causes the MR job to 
fail with something like:
   
   ```
   2026-02-11T13:54:17,945 ERROR [Time-limited test {}] 
impl.TableBackupClient(232): Unexpected exception in incremental-backup: 
incremental copy backup_1770846846624Output directory 
hdfs://localhost:64120/backupUT/.tmp/backup_1770846846624 already exists
   ```
   
   # Solution
   
   ## Summary
   
   This fix changes the bulk load output directory for continuous incremental 
backups.  Since the `WALPlayer` is run individually for each table, each 
WALs-to-HFiles conversion can be sent to a directory for that specific table.  
An example bulk load output directory for `table1` in the `default` namespace 
would be `backupRoot/.tmp/backup_X/default/table1`.  Then, `table2` would get 
its own bulk load output directory, etc.
   
   ## Issues
   
   Getting the proper bulk load output and getting the `distcp` to run 
successfully took more effort than expected.  Changing the bulk load output 
directory for each table was simple.  The real challenge was getting the HFiles 
to be output in the proper format.  If we set the output directory for `table1` 
to be `backupRoot/.tmp/backup_X/default/table1`, then the HFiles would instead 
be output to `backupRoot/.tmp/backup_X/default/table1/default/table1`, where 
the namespace and table name directories are repeated.  This caused the `.tmp` 
directory structure to look like the following:
   
   ```
   backupRootDir
   ├── .tmp
   │   └── backup_02INCR
   │       └── default
   │           ├── table1
   │           │   ├── _SUCCESS
   │           │   └── default
   │           │       └── table1
   │           │           └── cf
   │           └── table2
   │               ├── _SUCCESS
   │               └── default
   │                   └── table2
   │                       └── cf
   ├── backup_01FULL
   │   ├── .backup.manifest
   │   └── default
   │       ├── table1
   │       │   └── .hbase-snapshot
   │       └── table2
   │           └── .hbase-snapshot
   └── backup_02INCR
       └── default
           ├── table1
           │   ├── .tabledesc
           │   └── 8d01b
           └── table2
               ├── .tabledesc
               └── 5g03w
   ```
   
   Telling `distcp` to copy `backupRoot/.tmp/backup_X` means these "double" 
`default/table1/default/table1` directories will be incorrectly copied into the 
backup directory, like so:
   
   ```
   backup_INCR02
   ├── default
   │   ├── table1
   │   │   ├── .tabledesc
   │   │   ├── 8d01b
   │   │   └── default
   │   │       └── table1
   │   │           └── cf
   ```
   
   Telling `distcp` to just copy the deeper `default/table` directories 
resulted in a failure from `distcp` due to conflicting source directory names.  
This works if there is only one table in each namespace, but does not work if a 
namespace has multiple directories.  This is because the `distcp` looks as 
follows:
   
   ```
   distcp backupRoot/.tmp/backup_X/default/table1/default 
backupRoot/.tmp/backup_X/default/table1/default <destination>
   ```
   
   Resulting in an error like:
   
   ```
   2026-03-03T09:20:01,847 ERROR [Time-limited test {}] 
mapreduce.MapReduceBackupCopyJob$BackupDistCp(235): 
org.apache.hadoop.tools.CopyListing$DuplicateFileException: File 
hdfs://localhost:60356/backupUT/.tmp/backup_1772558388312/default/table1/default
 and 
hdfs://localhost:60356/backupUT/.tmp/backup_1772558388312/default/table2/default
 would cause duplicates. Aborting
   ```
   
   Copying just the deeper table name directories results in an improper 
directory structure in the destination:
   
   ```
   backup_INCR02
   ├── default
   │   ├── table1
   │   │   ├── .tabledesc
   │   │   └── 8d01b
   │   └── table2
   │       ├── .tabledesc
   │       └── 5g03w
   ├── table1
   └── table2
   ```
   
   Using `-update` in the `distcp` command did not get the desired result 
either.
   
   ## Potential Workaround
   
   A workaround for the issues mentioned above would be to run the `distcp` for 
each namespace.  Then, the source directories would be unique table names.  
However, this means a different `distcp` would need to be performed for each 
namespace in the backup set.
   
   ## The Actual Solution
   
   We want the WALs-to-HFiles to output to something like this in `.tmp`:
   
   ```
   backupRootDir
   ├── .tmp
   │   └── backup_02INCR
   │       ├── default
   │       │   ├── table1
   │       │   │   └── cf
   │       │   └── table2
   │       │       └── cf
   │       ├── namespace1
   │       │   ├── table3
   │       │   │   └── cf
   │       │   └── table4
   │       │       └── cf
   │       └── namespace2
   │           ├── table5
   │           │   └── cf
   │           └── table6
   │               └── cf
   ```
   
   In order to get rid of the "double `namespace/tableName`" directory 
structure, we have to change how the HFiles are output.  We want to keep our 
bulk load output directory as `backupRoot/.tmp/backup_X/namespace/table` and 
have just the `cf` column family directory sent there, not `namespace/table/cf`.
   
   This is done by setting the 
`hbase.mapreduce.use.multi.table.hfileoutputformat` config key to `false` for 
continuous incremental backups.  The problem here is `WALPlayer.java` was 
always using `MultiTableHFileOutputFormat`, which implicitly sets 
`hbase.mapreduce.use.multi.table.hfileoutputformat` to `true`.  That's why some 
changes were made to the logic in `WALPlayer.java`.
   
   Also, this `hfileoutputformat` config key needs to be `false` when replaying 
the WALs during a restore.  Otherwise, a failure occurs like the following:
   
   ```
   2026-03-05T18:32:55,042 WARN  [Thread-1018 {}] 
mapred.LocalJobRunner$Job(590): job_local1580221296_0005
   java.lang.Exception: java.lang.IllegalArgumentException: Invalid format for 
composite key [rowLoad0]. Cannot extract tablename and suffix from key
       at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492) 
~[hadoop-mapreduce-client-common-3.4.2.jar:?]
       at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:559) 
~[hadoop-mapreduce-client-common-3.4.2.jar:?]
   Caused by: java.lang.IllegalArgumentException: Invalid format for composite 
key [rowLoad0]. Cannot extract tablename and suffix from key
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to