[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16696374#comment-16696374 ]
Hudson commented on HBASE-21387: -------------------------------- SUCCESS: Integrated in Jenkins build HBase-1.3-IT #503 (See [https://builds.apache.org/job/HBase-1.3-IT/503/]) HBASE-21387 Race condition surrounding in progress snapshot handling in (openinx: rev 53b5abc6513ac7f716b8011a02685ae0f2dc3eef) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/FileCleanerDelegate.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/EnabledTableSnapshotHandler.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/TakeSnapshotHandler.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/DisabledTableSnapshotHandler.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotHFileCleaner.java * (add) hbase-server/src/test/java/org/apache/hadoop/hbase/snapshot/TestSnapshotWhenChoreCleaning.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/BaseFileCleanerDelegate.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/CleanerChore.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/snapshot/TestSnapshotFileCache.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/snapshot/TestSnapshotHFileCleaner.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/HFileCleaner.java > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > ---------------------------------------------------------------------------------------------------------- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu > Assignee: Ted Yu > Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, > 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, > 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)