[
https://issues.apache.org/jira/browse/HBASE-29744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nick Dimiduk updated HBASE-29744:
---------------------------------
Resolution: Fixed
Status: Resolved (was: Patch Available)
Pushed to branch-2.6+. Thanks [~hgromer]!
> Data loss scenario for WAL files belonging to RS added between backups
> ----------------------------------------------------------------------
>
> Key: HBASE-29744
> URL: https://issues.apache.org/jira/browse/HBASE-29744
> Project: HBase
> Issue Type: Bug
> Components: backup&restore
> Reporter: Hernan Gelaf-Romer
> Assignee: Hernan Gelaf-Romer
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.6.5
>
>
> Incremental backups can fail with a FileNotFoundException when trying to
> process Write-Ahead Log (WAL) files from RegionServers that were added to the
> cluster after the last successful backup.
> The issue occurs in BackupLogCleaner.canDeleteFile(), which checks timestamp
> boundaries (stored in the backup system table) to determine if WAL files are
> safe to delete. When no boundary exists for a RegionServer address, the
> cleaner incorrectly assumes that the WALs can safely be deleted and returns
> true. This situation arises when a new RegionServer is added between backups.
> The new server generates WAL files for tables, but since a backup has not yet
> completed, no timestamp boundary for this server is recorded. As a result,
> the cleaner may delete these WAL files before the next backup can process
> them, leading to a FileNotFoundException.
>
> Additionally, I believe this can lead to data loss
>
> When an incremental backup runs,
> {{IncrementalBackupManager.getLogFilesForNewBackup()}} scans the filesystem
> and builds a list of WAL files to back up, including files from newly added
> RegionServers. Before the backup processes these files, {{BackupLogCleaner}}
> runs concurrently and checks timestamp boundaries to determine which files
> can be safely deleted. When it finds no timestamp boundary for a new server,
> it incorrectly assumes the WALs are safe to delete and removes them.
> When the backup later tries to process the deleted files,
> {{IncrementalTableBackupClient.filterMissingFiles()}} is called to validate
> the file list. For each missing file, this method only logs a warning message
> and silently excludes it from the backup. The backup then continues and
> completes with a successful status, even though data from the deleted WAL
> files was never backed up.
> This results in permanent data loss with no failure indication: the backup
> appears successful, the source WAL files are permanently deleted, and the
> only evidence is a warning message in the logs that may go unnoticed. The
> data from those WAL files cannot be recovered because both the backup and the
> source are missing that data.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)