[jira] [Updated] (HBASE-29744) Data loss scenario for WAL files belonging to RS added between backups

Nick Dimiduk (Jira) Mon, 29 Dec 2025 07:29:07 -0800


     [ 
https://issues.apache.org/jira/browse/HBASE-29744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nick Dimiduk updated HBASE-29744:
---------------------------------
    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Pushed to branch-2.6+. Thanks [~hgromer]!

> Data loss scenario for WAL files belonging to RS added between backups
> ----------------------------------------------------------------------
>
>                 Key: HBASE-29744
>                 URL: https://issues.apache.org/jira/browse/HBASE-29744
>             Project: HBase
>          Issue Type: Bug
>          Components: backup&amp;restore
>            Reporter: Hernan Gelaf-Romer
>            Assignee: Hernan Gelaf-Romer
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.6.5
>
>
> Incremental backups can fail with a FileNotFoundException when trying to 
> process Write-Ahead Log (WAL) files from RegionServers that were added to the 
> cluster after the last successful backup.
> The issue occurs in BackupLogCleaner.canDeleteFile(), which checks timestamp 
> boundaries (stored in the backup system table) to determine if WAL files are 
> safe to delete. When no boundary exists for a RegionServer address, the 
> cleaner incorrectly assumes that the WALs can safely be deleted and returns 
> true. This situation arises when a new RegionServer is added between backups. 
> The new server generates WAL files for tables, but since a backup has not yet 
> completed, no timestamp boundary for this server is recorded. As a result, 
> the cleaner may delete these WAL files before the next backup can process 
> them, leading to a FileNotFoundException.
>  
> Additionally, I believe this can lead to data loss
>  
> When an incremental backup runs, 
> {{IncrementalBackupManager.getLogFilesForNewBackup()}} scans the filesystem 
> and builds a list of WAL files to back up, including files from newly added 
> RegionServers. Before the backup processes these files, {{BackupLogCleaner}} 
> runs concurrently and checks timestamp boundaries to determine which files 
> can be safely deleted. When it finds no timestamp boundary for a new server, 
> it incorrectly assumes the WALs are safe to delete and removes them.
> When the backup later tries to process the deleted files, 
> {{IncrementalTableBackupClient.filterMissingFiles()}} is called to validate 
> the file list. For each missing file, this method only logs a warning message 
> and silently excludes it from the backup. The backup then continues and 
> completes with a successful status, even though data from the deleted WAL 
> files was never backed up.
> This results in permanent data loss with no failure indication: the backup 
> appears successful, the source WAL files are permanently deleted, and the 
> only evidence is a warning message in the logs that may go unnoticed. The 
> data from those WAL files cannot be recovered because both the backup and the 
> source are missing that data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-29744) Data loss scenario for WAL files belonging to RS added between backups

Reply via email to