[ 
https://issues.apache.org/jira/browse/HBASE-29744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hernan Gelaf-Romer updated HBASE-29744:
---------------------------------------
    Description: 
Incremental backups can fail with a FileNotFoundException when trying to 
process Write-Ahead Log (WAL) files from RegionServers that were added to the 
cluster after the last successful backup.

The issue occurs in BackupLogCleaner.canDeleteFile(), which checks timestamp 
boundaries (stored in the backup system table) to determine if WAL files are 
safe to delete. When no boundary exists for a RegionServer address, the cleaner 
incorrectly assumes that the WALs can safely be deleted and returns true. This 
situation arises when a new RegionServer is added between backups. The new 
server generates WAL files for tables, but since a backup has not yet 
completed, no timestamp boundary for this server is recorded. As a result, the 
cleaner may delete these WAL files before the next backup can process them, 
leading to a FileNotFoundException.

 

Additionally, I believe this can lead to data loss

 

When an incremental backup runs, 
{{IncrementalBackupManager.getLogFilesForNewBackup()}} scans the filesystem and 
builds a list of WAL files to back up, including files from newly added 
RegionServers. Before the backup processes these files, {{BackupLogCleaner}} 
runs concurrently and checks timestamp boundaries to determine which files can 
be safely deleted. When it finds no timestamp boundary for a new server, it 
incorrectly assumes the WALs are safe to delete and removes them.

When the backup later tries to process the deleted files, 
{{IncrementalTableBackupClient.filterMissingFiles()}} is called to validate the 
file list. For each missing file, this method only logs a warning message and 
silently excludes it from the backup. The backup then continues and completes 
with a successful status, even though data from the deleted WAL files was never 
backed up.

This results in permanent data loss with no failure indication: the backup 
appears successful, the source WAL files are permanently deleted, and the only 
evidence is a warning message in the logs that may go unnoticed. The data from 
those WAL files cannot be recovered because both the backup and the source are 
missing that data.

  was:
Incremental backups can fail with a FileNotFoundException when trying to 
process Write-Ahead Log (WAL) files from RegionServers that were added to the 
cluster after the last successful backup.

The issue occurs in BackupLogCleaner.canDeleteFile(), which checks timestamp 
boundaries (stored in the backup system table) to determine if WAL files are 
safe to delete. When no boundary exists for a RegionServer address, the cleaner 
incorrectly assumes that the WALs can safely be deleted and returns true. This 
situation arises when a new RegionServer is added between backups. The new 
server generates WAL files for tables, but since a backup has not yet 
completed, no timestamp boundary for this server is recorded. As a result, the 
cleaner may delete these WAL files before the next backup can process them, 
leading to a FileNotFoundException.


>  Incremental backups fail with FileNotFoundException when trying to process 
> WAL files from RegionServers that were added to the cluster after the last 
> successful backup.
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29744
>                 URL: https://issues.apache.org/jira/browse/HBASE-29744
>             Project: HBase
>          Issue Type: Bug
>          Components: backup&restore
>            Reporter: Hernan Gelaf-Romer
>            Priority: Major
>
> Incremental backups can fail with a FileNotFoundException when trying to 
> process Write-Ahead Log (WAL) files from RegionServers that were added to the 
> cluster after the last successful backup.
> The issue occurs in BackupLogCleaner.canDeleteFile(), which checks timestamp 
> boundaries (stored in the backup system table) to determine if WAL files are 
> safe to delete. When no boundary exists for a RegionServer address, the 
> cleaner incorrectly assumes that the WALs can safely be deleted and returns 
> true. This situation arises when a new RegionServer is added between backups. 
> The new server generates WAL files for tables, but since a backup has not yet 
> completed, no timestamp boundary for this server is recorded. As a result, 
> the cleaner may delete these WAL files before the next backup can process 
> them, leading to a FileNotFoundException.
>  
> Additionally, I believe this can lead to data loss
>  
> When an incremental backup runs, 
> {{IncrementalBackupManager.getLogFilesForNewBackup()}} scans the filesystem 
> and builds a list of WAL files to back up, including files from newly added 
> RegionServers. Before the backup processes these files, {{BackupLogCleaner}} 
> runs concurrently and checks timestamp boundaries to determine which files 
> can be safely deleted. When it finds no timestamp boundary for a new server, 
> it incorrectly assumes the WALs are safe to delete and removes them.
> When the backup later tries to process the deleted files, 
> {{IncrementalTableBackupClient.filterMissingFiles()}} is called to validate 
> the file list. For each missing file, this method only logs a warning message 
> and silently excludes it from the backup. The backup then continues and 
> completes with a successful status, even though data from the deleted WAL 
> files was never backed up.
> This results in permanent data loss with no failure indication: the backup 
> appears successful, the source WAL files are permanently deleted, and the 
> only evidence is a warning message in the logs that may go unnoticed. The 
> data from those WAL files cannot be recovered because both the backup and the 
> source are missing that data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to