Ted Yu created HBASE-21387:
------------------------------

             Summary: Race condition in snapshot cache refreshing leads to loss 
of snapshot files
                 Key: HBASE-21387
                 URL: https://issues.apache.org/jira/browse/HBASE-21387
             Project: HBase
          Issue Type: Bug
            Reporter: Ted Yu
            Assignee: Ted Yu


During recent report from customer where ExportSnapshot failed:
{code}
2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
snapshot.SnapshotReferenceUtil: Can't find hfile: 
44f6c3c646e84de6a63fe30da4fcb3aa in the real 
(hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
 or archive 
(hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
 directory for the primary table. 
{code}
We found the following in log:
{code}
2018-10-09 18:54:23,675 DEBUG 
[00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
cleaner.HFileCleaner: Removing: 
hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
from archive
{code}
The root cause is race condition surrounding SnapshotFileCache#refreshCache().
There are two callers of refreshCache: one from RefreshCacheTask#run and the 
other from SnapshotHFileCleaner.
Let's look at the code of refreshCache:
{code}
    // if the snapshot directory wasn't modified since we last check, we are 
done
    if (dirStatus.getModificationTime() <= this.lastModifiedTime) return;

    // 1. update the modified time
    this.lastModifiedTime = dirStatus.getModificationTime();

    // 2.clear the cache
    this.cache.clear();
{code}
Suppose the RefreshCacheTask runs past the if check and sets 
this.lastModifiedTime
The cleaner executes refreshCache and returns immediately since 
this.lastModifiedTime matches the modification time of the directory.
Now RefreshCacheTask clears the cache. By the time the cleaner performs cache 
lookup, the cache is empty.
Therefore cleaner puts the file into unReferencedFiles - leading to data loss.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to