Bryan Beaudreault created HBASE-28462:
-----------------------------------------

             Summary: Incremental backup can fail if log gets archived while 
WALPlayer is starting up
                 Key: HBASE-28462
                 URL: https://issues.apache.org/jira/browse/HBASE-28462
             Project: HBase
          Issue Type: Bug
            Reporter: Bryan Beaudreault


We had incremental backup fail with FileNotFoundException for a file in the 
WALs directory. Upon investigation, the log had been archived a few mins 
earlier. WALInputFormat's record reader has support for falling back on an 
archived path:
{code:java}
} catch (IOException e) {
  Path archivedLog = AbstractFSWALProvider.findArchivedLog(logFile, conf);
  // archivedLog can be null if unable to locate in archiveDir.
  if (archivedLog != null) {
    openReader(archivedLog);
    // Try call again in recursion
    return nextKeyValue();
  } else {
    throw e;
  }
} {code}
But the getSplits method has different handling:
{code:java}
try {
  List<FileStatus> files = getFiles(fs, inputPath, startTime, endTime);
  allFiles.addAll(files);
} catch (FileNotFoundException e) {
  if (ignoreMissing) {
    LOG.warn("File " + inputPath + " is missing. Skipping it.");
    continue;
  }
  throw e;
} {code}
This ignoreMissing variable was added in HBASE-14141 and is enabled via 
wal.input.ignore.missing.files which is defaulted to false and never set. 
Looking at the comment and reviewboard history of HBASE-14141 I think there 
might have been some confusion about where to handle these missing files, and 
this got lost in the shuffle.
 
I would prefer not to ignore missing hfiles. I think that could result in some 
weird behavior: * RegionServer has 10 archived and 30 not-yet-archived WALs 
needing to be backed up
 * The process starts, and while it's running 1 of those 30 WALs gets archived. 
That would get skipped due to FileNotFoundException
 * But the remaining 29 would be backed up

This scenario could cause some data consistency issues if this incremental 
backup is restored. We missed some edits in the middle of applied edits from 
other WALs.

So I do think failing as we do today is necessary for consistency, but 
unrealistic in a live cluster. The solution is to try finding the missing file 
in the archived directory. Backups has a coprocessor which will not allow the 
archived file to be cleaned up until it's backed up, so I think it's safe to 
say that a WAL is either definitely in WALs or oldWALs.
 *  

- 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to