[
https://issues.apache.org/jira/browse/HBASE-29744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hernan Gelaf-Romer updated HBASE-29744:
---------------------------------------
Description:
Incremental backups can fail with a FileNotFoundException when trying to
process Write-Ahead Log (WAL) files from RegionServers that were added to the
cluster after the last successful backup.
The issue occurs in BackupLogCleaner.canDeleteFile(), which checks timestamp
boundaries (stored in the backup system table) to determine if WAL files are
safe to delete. When no boundary exists for a RegionServer address, the cleaner
incorrectly assumes that the WALs can safely be deleted and returns true. This
situation arises when a new RegionServer is added between backups. The new
server generates WAL files for tables, but since a backup has not yet
completed, no timestamp boundary for this server is recorded. As a result, the
cleaner may delete these WAL files before the next backup can process them,
leading to a FileNotFoundException.
Additionally, I believe this can lead to data loss
When an incremental backup runs,
{{IncrementalBackupManager.getLogFilesForNewBackup()}} scans the filesystem and
builds a list of WAL files to back up, including files from newly added
RegionServers. Before the backup processes these files, {{BackupLogCleaner}}
runs concurrently and checks timestamp boundaries to determine which files can
be safely deleted. When it finds no timestamp boundary for a new server, it
incorrectly assumes the WALs are safe to delete and removes them.
When the backup later tries to process the deleted files,
{{IncrementalTableBackupClient.filterMissingFiles()}} is called to validate the
file list. For each missing file, this method only logs a warning message and
silently excludes it from the backup. The backup then continues and completes
with a successful status, even though data from the deleted WAL files was never
backed up.
This results in permanent data loss with no failure indication: the backup
appears successful, the source WAL files are permanently deleted, and the only
evidence is a warning message in the logs that may go unnoticed. The data from
those WAL files cannot be recovered because both the backup and the source are
missing that data.
was:
Incremental backups can fail with a FileNotFoundException when trying to
process Write-Ahead Log (WAL) files from RegionServers that were added to the
cluster after the last successful backup.
The issue occurs in BackupLogCleaner.canDeleteFile(), which checks timestamp
boundaries (stored in the backup system table) to determine if WAL files are
safe to delete. When no boundary exists for a RegionServer address, the cleaner
incorrectly assumes that the WALs can safely be deleted and returns true. This
situation arises when a new RegionServer is added between backups. The new
server generates WAL files for tables, but since a backup has not yet
completed, no timestamp boundary for this server is recorded. As a result, the
cleaner may delete these WAL files before the next backup can process them,
leading to a FileNotFoundException.
> Incremental backups fail with FileNotFoundException when trying to process
> WAL files from RegionServers that were added to the cluster after the last
> successful backup.
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-29744
> URL: https://issues.apache.org/jira/browse/HBASE-29744
> Project: HBase
> Issue Type: Bug
> Components: backup&restore
> Reporter: Hernan Gelaf-Romer
> Priority: Major
>
> Incremental backups can fail with a FileNotFoundException when trying to
> process Write-Ahead Log (WAL) files from RegionServers that were added to the
> cluster after the last successful backup.
> The issue occurs in BackupLogCleaner.canDeleteFile(), which checks timestamp
> boundaries (stored in the backup system table) to determine if WAL files are
> safe to delete. When no boundary exists for a RegionServer address, the
> cleaner incorrectly assumes that the WALs can safely be deleted and returns
> true. This situation arises when a new RegionServer is added between backups.
> The new server generates WAL files for tables, but since a backup has not yet
> completed, no timestamp boundary for this server is recorded. As a result,
> the cleaner may delete these WAL files before the next backup can process
> them, leading to a FileNotFoundException.
>
> Additionally, I believe this can lead to data loss
>
> When an incremental backup runs,
> {{IncrementalBackupManager.getLogFilesForNewBackup()}} scans the filesystem
> and builds a list of WAL files to back up, including files from newly added
> RegionServers. Before the backup processes these files, {{BackupLogCleaner}}
> runs concurrently and checks timestamp boundaries to determine which files
> can be safely deleted. When it finds no timestamp boundary for a new server,
> it incorrectly assumes the WALs are safe to delete and removes them.
> When the backup later tries to process the deleted files,
> {{IncrementalTableBackupClient.filterMissingFiles()}} is called to validate
> the file list. For each missing file, this method only logs a warning message
> and silently excludes it from the backup. The backup then continues and
> completes with a successful status, even though data from the deleted WAL
> files was never backed up.
> This results in permanent data loss with no failure indication: the backup
> appears successful, the source WAL files are permanently deleted, and the
> only evidence is a warning message in the logs that may go unnoticed. The
> data from those WAL files cannot be recovered because both the backup and the
> source are missing that data.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)