[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673171#comment-16673171 ]
Ted Yu commented on HBASE-21387: -------------------------------- >From https://builds.apache.org/job/PreCommit-HBASE-Build/14932/console : {code} 00:38:23 +1 overall 00:38:23 00:38:23 | Vote | Subsystem | Runtime | Comment 00:38:23 ============================================================================ 00:38:23 | 0 | reexec | 0m 11s | Docker mode activated. 00:38:23 | 0 | patch | 0m 2s | The patch file was not named according 00:38:23 | | | | to hbase's naming conventions. Please 00:38:23 | | | | see 00:38:23 | | | | https://yetus.apache.org/documentation/0. 00:38:23 | | | | 8.0/precommit-patchnames for 00:38:23 | | | | instructions. 00:38:23 | | | | Prechecks 00:38:23 | +1 | hbaseanti | 0m 0s | Patch does not have any anti-patterns. 00:38:23 | +1 | @author | 0m 0s | The patch does not contain any @author 00:38:23 | | | | tags. 00:38:23 | -0 | test4tests | 0m 0s | The patch doesn't appear to include any 00:38:23 | | | | new or modified tests. Please justify 00:38:23 | | | | why no new tests are needed for this 00:38:23 | | | | patch. Also please list what manual 00:38:23 | | | | steps were performed to verify this 00:38:23 | | | | patch. 00:38:23 | | | | master Compile Tests 00:38:23 | +1 | mvninstall | 4m 49s | master passed 00:38:23 | +1 | compile | 1m 46s | master passed 00:38:23 | +1 | checkstyle | 1m 7s | master passed 00:38:23 | +1 | shadedjars | 4m 2s | branch has no errors when building our 00:38:23 | | | | shaded downstream artifacts. 00:38:23 | +1 | findbugs | 2m 1s | master passed 00:38:23 | +1 | javadoc | 0m 30s | master passed 00:38:23 | | | | Patch Compile Tests 00:38:23 | +1 | mvninstall | 4m 45s | the patch passed 00:38:23 | +1 | compile | 1m 50s | the patch passed 00:38:23 | +1 | javac | 1m 50s | the patch passed 00:38:23 | +1 | checkstyle | 1m 4s | the patch passed 00:38:23 | +1 | whitespace | 0m 0s | The patch has no whitespace issues. 00:38:23 | +1 | shadedjars | 4m 6s | patch has no errors when building our 00:38:23 | | | | shaded downstream artifacts. 00:38:24 | +1 | hadoopcheck | 9m 53s | Patch does not cause any errors with 00:38:24 | | | | Hadoop 2.7.4 or 3.0.0. 00:38:24 | +1 | findbugs | 2m 11s | the patch passed 00:38:24 | +1 | javadoc | 0m 29s | the patch passed 00:38:24 | | | | Other Tests 00:38:24 | +1 | unit | 128m 21s | hbase-server in the patch passed. 00:38:24 | +1 | asflicense | 0m 25s | The patch does not generate ASF License 00:38:24 | | | | warnings. 00:38:24 | | | 168m 0s | 00:38:24 00:38:24 00:38:24 || Subsystem || Report/Notes || 00:38:24 ============================================================================ 00:38:24 | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | 00:38:24 | JIRA Issue | HBASE-21387 | 00:38:24 | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946617/21387.v3.txt | {code} > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > ---------------------------------------------------------------------------------------------------------- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu > Assignee: Ted Yu > Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > which only excludes the temp dir, but not in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo > for the in progress snapshot doesn't include all store file (leaving some > hole in cache). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)