[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672059#comment-16672059 ]
Ted Yu commented on HBASE-21387: -------------------------------- Thanks for the review, Josh. The cache was introduced by HBASE-6865. Let me dig some more in order to better assess the relationship between the callers of refreshCache(). Meanwhile, I was looking at another aspect - in progress snapshot(s). Note the existing check: {code} if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { {code} which only excludes the temp dir, but not in progress snapshot(s). I think something such as the following would be more appropriate : {code} diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/ index 358b4ea..c303667 100644 --- a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java @@ -232,7 +232,8 @@ public class SnapshotFileCache implements Stoppable { Map<String, SnapshotDirectoryInfo> known = new HashMap<>(); // 3. check each of the snapshot directories - FileStatus[] snapshots = FSUtils.listStatus(fs, snapshotDir); + FileStatus[] snapshots = fs.listStatus(snapshotDir, + new SnapshotDescriptionUtils.CompletedSnaphotDirectoriesFilter(fs)); if (snapshots == null) { // remove all the remembered snapshots because we don't have any left if (LOG.isDebugEnabled() && this.snapshots.size() > 0) { {code} > Race condition in snapshot cache refreshing leads to loss of snapshot files > --------------------------------------------------------------------------- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu > Assignee: Ted Yu > Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding SnapshotFileCache#refreshCache(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > // if the snapshot directory wasn't modified since we last check, we are > done > if (dirStatus.getModificationTime() <= this.lastModifiedTime) return; > // 1. update the modified time > this.lastModifiedTime = dirStatus.getModificationTime(); > // 2.clear the cache > this.cache.clear(); > {code} > Suppose the RefreshCacheTask runs past the if check and sets > this.lastModifiedTime > The cleaner executes refreshCache and returns immediately since > this.lastModifiedTime matches the modification time of the directory. > Now RefreshCacheTask clears the cache. By the time the cleaner performs cache > lookup, the cache is empty. > Therefore cleaner puts the file into unReferencedFiles - leading to data loss. -- This message was sent by Atlassian JIRA (v7.6.3#76005)