[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ted Yu updated HBASE-21387: --------------------------- Attachment: 21387.v1.txt > Race condition in snapshot cache refreshing leads to loss of snapshot files > --------------------------------------------------------------------------- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu > Assignee: Ted Yu > Priority: Major > Attachments: 21387.v1.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding SnapshotFileCache#refreshCache(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > // if the snapshot directory wasn't modified since we last check, we are > done > if (dirStatus.getModificationTime() <= this.lastModifiedTime) return; > // 1. update the modified time > this.lastModifiedTime = dirStatus.getModificationTime(); > // 2.clear the cache > this.cache.clear(); > {code} > Suppose the RefreshCacheTask runs past the if check and sets > this.lastModifiedTime > The cleaner executes refreshCache and returns immediately since > this.lastModifiedTime matches the modification time of the directory. > Now RefreshCacheTask clears the cache. By the time the cleaner performs cache > lookup, the cache is empty. > Therefore cleaner puts the file into unReferencedFiles - leading to data loss. -- This message was sent by Atlassian JIRA (v7.6.3#76005)