[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672059#comment-16672059
 ] 

Ted Yu commented on HBASE-21387:
--------------------------------

Thanks for the review, Josh.
The cache was introduced by HBASE-6865.

Let me dig some more in order to better assess the relationship between the 
callers of refreshCache().

Meanwhile, I was looking at another aspect - in progress snapshot(s).
Note the existing check:
{code}
      if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
{code}
which only excludes the temp dir, but not in progress snapshot(s).
I think something such as the following would be more appropriate :
{code}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java
 b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/
index 358b4ea..c303667 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java
@@ -232,7 +232,8 @@ public class SnapshotFileCache implements Stoppable {
     Map<String, SnapshotDirectoryInfo> known = new HashMap<>();

     // 3. check each of the snapshot directories
-    FileStatus[] snapshots = FSUtils.listStatus(fs, snapshotDir);
+    FileStatus[] snapshots = fs.listStatus(snapshotDir,
+        new SnapshotDescriptionUtils.CompletedSnaphotDirectoriesFilter(fs));
     if (snapshots == null) {
       // remove all the remembered snapshots because we don't have any left
       if (LOG.isDebugEnabled() && this.snapshots.size() > 0) {
{code}

> Race condition in snapshot cache refreshing leads to loss of snapshot files
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-21387
>                 URL: https://issues.apache.org/jira/browse/HBASE-21387
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>            Priority: Major
>              Labels: snapshot
>         Attachments: 21387.v1.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding SnapshotFileCache#refreshCache().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>     // if the snapshot directory wasn't modified since we last check, we are 
> done
>     if (dirStatus.getModificationTime() <= this.lastModifiedTime) return;
>     // 1. update the modified time
>     this.lastModifiedTime = dirStatus.getModificationTime();
>     // 2.clear the cache
>     this.cache.clear();
> {code}
> Suppose the RefreshCacheTask runs past the if check and sets 
> this.lastModifiedTime
> The cleaner executes refreshCache and returns immediately since 
> this.lastModifiedTime matches the modification time of the directory.
> Now RefreshCacheTask clears the cache. By the time the cleaner performs cache 
> lookup, the cache is empty.
> Therefore cleaner puts the file into unReferencedFiles - leading to data loss.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to