Ted Yu created HBASE-21387: ------------------------------ Summary: Race condition in snapshot cache refreshing leads to loss of snapshot files Key: HBASE-21387 URL: https://issues.apache.org/jira/browse/HBASE-21387 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu
During recent report from customer where ExportSnapshot failed: {code} 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] snapshot.SnapshotReferenceUtil: Can't find hfile: 44f6c3c646e84de6a63fe30da4fcb3aa in the real (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) or archive (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) directory for the primary table. {code} We found the following in log: {code} 2018-10-09 18:54:23,675 DEBUG [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] cleaner.HFileCleaner: Removing: hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa from archive {code} The root cause is race condition surrounding SnapshotFileCache#refreshCache(). There are two callers of refreshCache: one from RefreshCacheTask#run and the other from SnapshotHFileCleaner. Let's look at the code of refreshCache: {code} // if the snapshot directory wasn't modified since we last check, we are done if (dirStatus.getModificationTime() <= this.lastModifiedTime) return; // 1. update the modified time this.lastModifiedTime = dirStatus.getModificationTime(); // 2.clear the cache this.cache.clear(); {code} Suppose the RefreshCacheTask runs past the if check and sets this.lastModifiedTime The cleaner executes refreshCache and returns immediately since this.lastModifiedTime matches the modification time of the directory. Now RefreshCacheTask clears the cache. By the time the cleaner performs cache lookup, the cache is empty. Therefore cleaner puts the file into unReferencedFiles - leading to data loss. -- This message was sent by Atlassian JIRA (v7.6.3#76005)