[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zheng Hu updated HBASE-21387: ----------------------------- Fix Version/s: 1.2.10 2.1.2 1.4.9 2.0.3 2.2.0 1.3.3 1.5.0 3.0.0 > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > ---------------------------------------------------------------------------------------------------------- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu > Assignee: Ted Yu > Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, > 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, > HBASE-21387.v14.patch, HBASE-21387.v15.patch, HBASE-21387.v16.patch, > HBASE-21387.v17.patch, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)