[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671033#comment-16671033 ]
Ted Yu commented on HBASE-21387: -------------------------------- In the old code, to simulate the race condition, we can use CountDownLatch. Here is a sketch: {code} diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/ index 358b4ea..2941400 100644 --- a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java @@ -27,6 +27,7 @@ import java.util.Map; import java.util.Set; import java.util.Timer; import java.util.TimerTask; +import java.util.concurrent.CountDownLatch; import java.util.concurrent.locks.ReentrantLock; import org.apache.hadoop.conf.Configuration; @@ -92,6 +93,8 @@ public class SnapshotFileCache implements Stoppable { private final SnapshotFileInspector fileInspector; private final Path snapshotDir; private final Set<String> cache = new HashSet<>(); + private final CountDownLatch latchRefresh = new CountDownLatch(1); + private final CountDownLatch latchContains = new CountDownLatch(1); /** * This is a helper map of information about the snapshot directories so we don't need to rescan * them if they haven't changed since the last time we looked. @@ -180,16 +183,18 @@ public class SnapshotFileCache implements Stoppable { // cache, but that seems overkill at the moment and isn't necessarily a bottleneck. public synchronized Iterable<FileStatus> getUnreferencedFiles(Iterable<FileStatus> files, final SnapshotManager snapshotManager) - throws IOException { + throws IOException, InterruptedException { List<FileStatus> unReferencedFiles = Lists.newArrayList(); List<String> snapshotsInProgress = null; boolean refreshed = false; for (FileStatus file : files) { String fileName = file.getPath().getName(); if (!refreshed && !cache.contains(fileName)) { + latchRefresh.await(); refreshCache(); refreshed = true; } + latchContains.await(); if (cache.contains(fileName)) { continue; } @@ -226,9 +231,11 @@ public class SnapshotFileCache implements Stoppable { // 1. update the modified time this.lastModifiedTime = dirStatus.getModificationTime(); + latchRefresh.countDown(); // 2.clear the cache this.cache.clear(); + latchContains.countDown(); Map<String, SnapshotDirectoryInfo> known = new HashMap<>(); // 3. check each of the snapshot directories {code} With the fix, the race condition is gone. > Race condition in snapshot cache refreshing leads to loss of snapshot files > --------------------------------------------------------------------------- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu > Assignee: Ted Yu > Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding SnapshotFileCache#refreshCache(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > // if the snapshot directory wasn't modified since we last check, we are > done > if (dirStatus.getModificationTime() <= this.lastModifiedTime) return; > // 1. update the modified time > this.lastModifiedTime = dirStatus.getModificationTime(); > // 2.clear the cache > this.cache.clear(); > {code} > Suppose the RefreshCacheTask runs past the if check and sets > this.lastModifiedTime > The cleaner executes refreshCache and returns immediately since > this.lastModifiedTime matches the modification time of the directory. > Now RefreshCacheTask clears the cache. By the time the cleaner performs cache > lookup, the cache is empty. > Therefore cleaner puts the file into unReferencedFiles - leading to data loss. -- This message was sent by Atlassian JIRA (v7.6.3#76005)