[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

Josh Elser (JIRA) Wed, 21 Nov 2018 10:12:32 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695049#comment-16695049
 ]


Josh Elser commented on HBASE-21387:
------------------------------------

{code:java}
+  public boolean isAnySnapshotTaking() {
+    return this.takingSnapshotLock.getReadHoldCount() > 0 || 
this.snapshotHandlers.size() > 0;
+  }{code}
Looking at this, I almost wonder if, instead of using a ReadWriteLock, some 
Atomic Counter would be more simple? I think it's OK as-is, just thinking 
out-loud.

However, one problem: does this method need to be synchronized? 
{{snapshotHandlers}} is only a {{HashMap}}. Is calling {{.size()}} without 
holding the lock (SnapshotManager's Object monitor) ok to do? Maybe this is ok 
since you check the RWLock before calling {{.size()}}... If you have thought 
through this, I'm ok :)

{code}
+          cleaner.getFileCacheForTesting().triggerCacheRefreshForTesting();
+          Iterable<FileStatus> toDeleteFiles = 
cleaner.getDeletableFiles(files);
+          int size = Lists.newArrayList(toDeleteFiles).size();
+          LOG.info("Size of deletableFiles is: " + size);
+          if (size > 0) {
+            success.set(false);
+          }
{code}

Maybe print the files that were marked as deletable to help with debugging 
should it ever fail?

Running the test locally a few times, but I think this is OK. I think this is 
the right way to go about fixing this problem. Good work!! :)

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21387
>                 URL: https://issues.apache.org/jira/browse/HBASE-21387
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>            Priority: Major
>              Labels: snapshot
>         Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, 
> 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, 
> HBASE-21387.v14.patch, HBASE-21387.v15.patch, HBASE-21387.v16.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>       if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

Reply via email to