[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699230#comment-16699230 ] Josh Elser commented on HBASE-21387: Nevermind, I see HBASE-21511 was filed for this. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699220#comment-16699220 ] Josh Elser commented on HBASE-21387: bq. We will stop checking unreferenced files if there are snapshot operations in progress, but in the code below we will get snapshots in progress and try to filter out files... Hrm, maybe I was too quick in my review. You're saying we're doing extra work when there's a snapshot in progress when we could just short-circuit the entire operation? > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698178#comment-16698178 ] Hudson commented on HBASE-21387: Results for branch master [build #627 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/627/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/master/627//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/master/627//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/master/627//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698123#comment-16698123 ] Hudson commented on HBASE-21387: Results for branch branch-1 [build #564 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/564/]: (x) *{color:red}-1 overall{color}* details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/564//General_Nightly_Build_Report/] (x) {color:red}-1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/564//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/564//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 source release artifact{color} -- See build output for details. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698022#comment-16698022 ] Hudson commented on HBASE-21387: Results for branch branch-2 [build #1521 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1521/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1521//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1521//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1521//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698021#comment-16698021 ] Hudson commented on HBASE-21387: Results for branch branch-1.4 [build #560 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/560/]: (x) *{color:red}-1 overall{color}* details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/560//General_Nightly_Build_Report/] (x) {color:red}-1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/560//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/560//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698001#comment-16698001 ] Hudson commented on HBASE-21387: Results for branch branch-1.3 [build #552 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/552/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/552//General_Nightly_Build_Report/] (/) {color:green}+1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/552//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/552//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697975#comment-16697975 ] Hudson commented on HBASE-21387: Results for branch branch-1.2 [build #563 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/563/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/563//General_Nightly_Build_Report/] (/) {color:green}+1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/563//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/563//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697872#comment-16697872 ] Hudson commented on HBASE-21387: Results for branch branch-2.1 [build #632 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/632/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/632//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/632//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/632//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697864#comment-16697864 ] Hudson commented on HBASE-21387: Results for branch branch-2.0 [build #1110 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1110/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1110//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1110//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1110//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697728#comment-16697728 ] Hudson commented on HBASE-21387: SUCCESS: Integrated in Jenkins build HBase-1.2-IT #1185 (See [https://builds.apache.org/job/HBase-1.2-IT/1185/]) HBASE-21387 Addendum fix TestSnapshotFileCache (zhangduo: rev 96240732bfb4dfa28dc5fe6d445b9551d5ed9814) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697724#comment-16697724 ] Hudson commented on HBASE-21387: SUCCESS: Integrated in Jenkins build HBase-1.3-IT #504 (See [https://builds.apache.org/job/HBase-1.3-IT/504/]) HBASE-21387 Addendum fix TestSnapshotFileCache (zhangduo: rev ec7461d2020b0a375eeb9b725eb3202aeed4fb13) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697700#comment-16697700 ] Duo Zhang commented on HBASE-21387: --- +1 on the addendum for now, but I;d say the code is a bit confusing. We will stop checking unreferenced files if there are snapshot operations in progress, but in the code below we will get snapshots in progress and try to filter out files... > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697623#comment-16697623 ] Ted Yu commented on HBASE-21387: Ran the failed test locally with addendum which passed. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697558#comment-16697558 ] Hadoop QA commented on HBASE-21387: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 3s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange} 0m 0s{color} | {color:orange} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 3s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 48s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 5s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 44s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 56s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 49s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 19s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}134m 7s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}170m 8s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.client.TestMobRestoreSnapshotFromClientAfterSplittingRegions | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12949323/21387.addendum.txt | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux cb172c1a4402 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 6d0dc960e6 | | maven | version: Apache Maven 3.5.4
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697520#comment-16697520 ] Ted Yu commented on HBASE-21387: TestSnapshotFileCache fails across branches. https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1987/testReport/junit/org.apache.hadoop.hbase.master.snapshot/TestSnapshotFileCache/ One condition on SnapshotFileCache was incorrect, resulting in the following being logged repeatedly. {code} LOG.warn("Not checking unreferenced files since snapshot is running, it will " + "skip to clean the HFiles this time"); {code} With addendum, the test passes. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697080#comment-16697080 ] Hudson commented on HBASE-21387: Results for branch branch-2.1 [build #629 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/629/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/629//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/629//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/629//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, > 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, > 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696884#comment-16696884 ] Hudson commented on HBASE-21387: Results for branch master [build #624 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/624/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/master/624//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/master/624//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/master/624//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, > 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, > 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696705#comment-16696705 ] Hudson commented on HBASE-21387: Results for branch branch-2 [build #1519 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1519/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1519//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1519//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1519//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, > 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, > 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696644#comment-16696644 ] Hudson commented on HBASE-21387: Results for branch branch-1 [build #561 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/561/]: (x) *{color:red}-1 overall{color}* details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/561//General_Nightly_Build_Report/] (x) {color:red}-1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/561//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/561//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 source release artifact{color} -- See build output for details. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, > 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, > 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696640#comment-16696640 ] Hudson commented on HBASE-21387: Results for branch branch-2.0 [build #1107 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1107/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1107//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1107//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1107//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, > 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, > 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696603#comment-16696603 ] Hudson commented on HBASE-21387: Results for branch branch-1.4 [build #558 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/558/]: (x) *{color:red}-1 overall{color}* details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/558//General_Nightly_Build_Report/] (x) {color:red}-1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/558//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/558//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, > 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, > 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696569#comment-16696569 ] Hudson commented on HBASE-21387: Results for branch branch-1.3 [build #550 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/550/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/550//General_Nightly_Build_Report/] (x) {color:red}-1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/550//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/550//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, > 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, > 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696573#comment-16696573 ] Hudson commented on HBASE-21387: Results for branch branch-1.2 [build #561 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/561/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/561//General_Nightly_Build_Report/] (x) {color:red}-1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/561//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/561//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, > 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, > 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696375#comment-16696375 ] Hudson commented on HBASE-21387: SUCCESS: Integrated in Jenkins build HBase-1.2-IT #1184 (See [https://builds.apache.org/job/HBase-1.2-IT/1184/]) HBASE-21387 Race condition surrounding in progress snapshot handling in (openinx: rev d9693394f6ff7beb99c64f2dd188aa1bdff00f18) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/HFileCleaner.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/EnabledTableSnapshotHandler.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/DisabledTableSnapshotHandler.java * (add) hbase-server/src/test/java/org/apache/hadoop/hbase/snapshot/TestSnapshotWhenChoreCleaning.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/BaseFileCleanerDelegate.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotLogCleaner.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/CleanerChore.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/cleaner/TestSnapshotFromMaster.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotHFileCleaner.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/snapshot/TestSnapshotFileCache.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/snapshot/TestSnapshotHFileCleaner.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/TakeSnapshotHandler.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/FileCleanerDelegate.java > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, > 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, > 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696372#comment-16696372 ] Zheng Hu commented on HBASE-21387: -- Pushed to branch-2 & branch-2.0 & branch-2.1 & master & branch-1 & branch-1.2 & branch-1.3 & branch-1.4 , Thanks all. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, > 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, > 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696374#comment-16696374 ] Hudson commented on HBASE-21387: SUCCESS: Integrated in Jenkins build HBase-1.3-IT #503 (See [https://builds.apache.org/job/HBase-1.3-IT/503/]) HBASE-21387 Race condition surrounding in progress snapshot handling in (openinx: rev 53b5abc6513ac7f716b8011a02685ae0f2dc3eef) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/FileCleanerDelegate.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/EnabledTableSnapshotHandler.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/TakeSnapshotHandler.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/DisabledTableSnapshotHandler.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotHFileCleaner.java * (add) hbase-server/src/test/java/org/apache/hadoop/hbase/snapshot/TestSnapshotWhenChoreCleaning.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/BaseFileCleanerDelegate.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/CleanerChore.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/snapshot/TestSnapshotFileCache.java * (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/master/snapshot/TestSnapshotHFileCleaner.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/HFileCleaner.java > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, > 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, > 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it.
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696320#comment-16696320 ] Zheng Hu commented on HBASE-21387: -- Thanks [~elserj] and [~yuzhih...@gmail.com]'s careful reviewing. Will commit ( and fix the minor change) if no other conern. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, > 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, > 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696227#comment-16696227 ] Ted Yu commented on HBASE-21387: Looks like catching FileNotFoundException is not enough to pass the new test. Let's go with v17. {code} +LOG.debug("toDeleteFiles[{}] is: " + deletableFiles.get(i)); {code} Minor: looks like you intended to provide both index and FileStatus. There was only one argument above. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, > 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, > 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696208#comment-16696208 ] Hadoop QA commented on HBASE-21387: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 2s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 47s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 57s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 16s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 15s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 17s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 24s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 9m 19s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}124m 30s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 21s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}165m 21s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.snapshot.TestSnapshotWhenChoreCleaning | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12949212/21387-suggest.txt | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 6ff82b279620 4.4.0-134-generic #160~14.04.1-Ubuntu SMP Fri Aug 17 11:07:07 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 5cc845b713 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | unit |
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695873#comment-16695873 ] Hadoop QA commented on HBASE-21387: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 4 new or modified test files. {color} | || || || || {color:brown} branch-1.2 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 4s{color} | {color:green} branch-1.2 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s{color} | {color:green} branch-1.2 passed with JDK v1.8.0_192 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s{color} | {color:green} branch-1.2 passed with JDK v1.7.0_201 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 18s{color} | {color:green} branch-1.2 passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 2m 16s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} branch-1.2 passed with JDK v1.8.0_192 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} branch-1.2 passed with JDK v1.7.0_201 {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s{color} | {color:green} the patch passed with JDK v1.8.0_192 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s{color} | {color:green} the patch passed with JDK v1.7.0_201 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 11s{color} | {color:red} hbase-server: The patch generated 6 new + 242 unchanged - 2 fixed = 248 total (was 244) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 2m 13s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 49s{color} | {color:green} Patch does not cause any errors with Hadoop 2.4.1 2.5.2 2.6.5 2.7.4. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed with JDK v1.8.0_192 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} the patch passed with JDK v1.7.0_201 {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 75m 59s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 15s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}107m 20s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:34a9b27 | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12949178/HBASE-21387.branch-1.2.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 9ffddac44f9b 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695767#comment-16695767 ] Zheng Hu commented on HBASE-21387: -- HBASE-16490 did not apply to branch-1.3 & branch-1.2 , and branch-1.2 is diff with branch-1.3. so need to prepare 3 patches for branch-1/branch-1.4, branch-1.3 , branch-1.2: * HBASE-21387.branch-1.patch can be applied to branch-1 & branch-1.4 * HBASE-21387.branch-1.3.patch can be applied to branch-1.3; * HBASE-21387.branch-1.2.patch can be applied to branch-1.2 Let's wait the Hadoop QA. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, > 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, > HBASE-21387.v14.patch, HBASE-21387.v15.patch, HBASE-21387.v16.patch, > HBASE-21387.v17.patch, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695598#comment-16695598 ] Hadoop QA commented on HBASE-21387: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 20s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 13s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 59s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 57s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 56s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 54s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}129m 5s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}166m 33s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12949140/HBASE-21387.v17.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 2145e47967f3 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 5cc845b713 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/15089/testReport/ | | Max. process+thread count | 4751 (vs. ulimit of 1) | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/15089/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Race condition surrounding
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695529#comment-16695529 ] Zheng Hu commented on HBASE-21387: -- bq. does this method need to be synchronized? I think you are right, should be synchronized. Upload patch.v17 for addressing this and [~yuzhih...@gmail.com]'s comment. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, > 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, > HBASE-21387.v14.patch, HBASE-21387.v15.patch, HBASE-21387.v16.patch, > HBASE-21387.v17.patch, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695128#comment-16695128 ] Josh Elser commented on HBASE-21387: bq. Running the test locally a few times, but I think this is OK. I think this is the right way to go about fixing this problem. Good work!! Passed in a loop on my machine. LGTM, [~openinx]! > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, > 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, > HBASE-21387.v14.patch, HBASE-21387.v15.patch, HBASE-21387.v16.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695049#comment-16695049 ] Josh Elser commented on HBASE-21387: {code:java} + public boolean isAnySnapshotTaking() { + return this.takingSnapshotLock.getReadHoldCount() > 0 || this.snapshotHandlers.size() > 0; + }{code} Looking at this, I almost wonder if, instead of using a ReadWriteLock, some Atomic Counter would be more simple? I think it's OK as-is, just thinking out-loud. However, one problem: does this method need to be synchronized? {{snapshotHandlers}} is only a {{HashMap}}. Is calling {{.size()}} without holding the lock (SnapshotManager's Object monitor) ok to do? Maybe this is ok since you check the RWLock before calling {{.size()}}... If you have thought through this, I'm ok :) {code} + cleaner.getFileCacheForTesting().triggerCacheRefreshForTesting(); + Iterable toDeleteFiles = cleaner.getDeletableFiles(files); + int size = Lists.newArrayList(toDeleteFiles).size(); + LOG.info("Size of deletableFiles is: " + size); + if (size > 0) { +success.set(false); + } {code} Maybe print the files that were marked as deletable to help with debugging should it ever fail? Running the test locally a few times, but I think this is OK. I think this is the right way to go about fixing this problem. Good work!! :) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, > 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, > HBASE-21387.v14.patch, HBASE-21387.v15.patch, HBASE-21387.v16.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694979#comment-16694979 ] Josh Elser commented on HBASE-21387: Thanks, for the ping, Zheng. Taking a look at this now :) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, > 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, > HBASE-21387.v14.patch, HBASE-21387.v15.patch, HBASE-21387.v16.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694766#comment-16694766 ] Zheng Hu commented on HBASE-21387: -- [~yuzhih...@gmail.com], Any other concern ? For the name refactor & typo, I'll fix those when committing. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, > 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, > HBASE-21387.v14.patch, HBASE-21387.v15.patch, HBASE-21387.v16.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694577#comment-16694577 ] Hadoop QA commented on HBASE-21387: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 55s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 42s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 1s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 39s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 54s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 39s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 7m 59s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}128m 36s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}163m 1s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12949008/HBASE-21387.v16.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 9a043d0a42da 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 5ded294419 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/15084/testReport/ | | Max. process+thread count | 5181 (vs. ulimit of 1) | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/15084/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Race condition surrounding in
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694397#comment-16694397 ] Zheng Hu commented on HBASE-21387: -- Patch.v16 fix the checkstyle now. [~yuzhih...@gmail.com] & [~elserj], the hadoop QA seems good now, please take a look the patch. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, > 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, > HBASE-21387.v14.patch, HBASE-21387.v15.patch, HBASE-21387.v16.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694372#comment-16694372 ] Hadoop QA commented on HBASE-21387: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 59s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 6s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 49s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 54s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 48s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 7s{color} | {color:red} hbase-server: The patch generated 4 new + 48 unchanged - 0 fixed = 52 total (was 48) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 48s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 19s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}123m 10s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}159m 1s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12948989/HBASE-21387.v15.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 5b1722d8ef67 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 5ded294419 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | checkstyle | https://builds.apache.org/job/PreCommit-HBASE-Build/15083/artifact/patchprocess/diff-checkstyle-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/15083/testReport/ | | Max. process+thread count | 4883 (vs. ulimit of 1) | | modules | C: hbase-server U: hbase-server | | Console output |
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694136#comment-16694136 ] Zheng Hu commented on HBASE-21387: -- [~elserj], I think those failure test are unrelated to patch.v14. there are long time running tests, easy to timeout. you can see the HadoopQA in HBASE-21498. some problem too. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, > 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, > HBASE-21387.v14.patch, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693823#comment-16693823 ] Josh Elser commented on HBASE-21387: In v14, it looks like those tests are failing because we're saturating the call queues. Maybe something is happening that is preventing the count from being decremented and this backs up a bunch of Snapshot retry RPCs? Just a guess. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, > 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, > HBASE-21387.v14.patch, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693405#comment-16693405 ] Hadoop QA commented on HBASE-21387: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 56s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 5s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 47s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 57s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 1s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 45s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 22s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}244m 3s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 26s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}280m 4s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.coprocessor.TestMetaTableMetrics | | | hadoop.hbase.replication.TestReplicationKillSlaveRSWithSeparateOldWALs | | | hadoop.hbase.client.TestMobRestoreSnapshotFromClientAfterSplittingRegions | | | hadoop.hbase.client.TestCloneSnapshotFromClientAfterSplittingRegion | | | hadoop.hbase.client.TestRestoreSnapshotFromClientWithRegionReplicas | | | hadoop.hbase.client.TestMobCloneSnapshotFromClientAfterSplittingRegion | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12948847/HBASE-21387.v14.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 1ac9bec54c5b 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 405bf5e638 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693220#comment-16693220 ] Hadoop QA commented on HBASE-21387: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 57s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 48s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 5s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 47s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 55s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 46s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 22s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}124m 9s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}159m 48s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.snapshot.TestSnapshotWhenChoreCleaning | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12948839/HBASE-21387.v13.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 03f1405448eb 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 405bf5e638 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/15076/artifact/patchprocess/patch-unit-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/15076/testReport/ | | Max. process+thread count | 5084 (vs. ulimit of 1) | | modules | C: hbase-server U: hbase-server | | Console
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693142#comment-16693142 ] Ted Yu commented on HBASE-21387: I was aware of the above JIRA. Thanks for the unit test, Zheng. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, > 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, > HBASE-21387.v14.patch, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693127#comment-16693127 ] Zheng Hu commented on HBASE-21387: -- HBASE-16490 has tried to fix the race condition, but the race bug still happen in this issue. So l made a link between them. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, > 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, > HBASE-21387.v14.patch, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693068#comment-16693068 ] Zheng Hu commented on HBASE-21387: -- Patch v14 can fix this UT now, but I think it's not quite simple, will try to simplify the patch. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, > 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, > HBASE-21387.v14.patch, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693027#comment-16693027 ] Zheng Hu commented on HBASE-21387: -- Attached patch.v13 with an UT to reproduce this bug, I think there is still a bug in [~yuzhih...@gmail.com]'s patch.v12. because when I run the UT, I found the following log: {code} 2018-11-20 18:24:13,206 WARN [Thread-2846] snapshot.SnapshotFileCache(303): delete the hdfs://localhost:34113/user/openinx/test-data/4affaf35-9f36-d66a-606e-0bdc5c6a69a8/.hbase-snapshot/.tmp/snapshotName_0 due to exception: java.io.FileNotFoundException: File does not exist: /user/openinx/test-data/4affaf35-9f36-d66a-606e-0bdc5c6a69a8/.hbase-snapshot/.tmp/snapshotName_0/.snapshotinfo at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1847) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1819) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1733) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:366) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2211) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1237) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1222) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1210) at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:306) at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272) at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:264) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1535) at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:303) at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:299) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at org.apache.hadoop.hbase.snapshot.SnapshotDescriptionUtils.readSnapshotInfo(SnapshotDescriptionUtils.java:381) at org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.getHFileNames(SnapshotReferenceUtil.java:330) at org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner$1.filesUnderSnapshot(SnapshotHFileCleaner.java:102) at org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.getSnapshotsInProgress(SnapshotFileCache.java:297) at org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.getUnreferencedFiles(SnapshotFileCache.java:203) at org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner.getDeletableFiles(SnapshotHFileCleaner.java:68) at org.apache.hadoop.hbase.snapshot.TestSnapshotWhenChoreCleaning.lambda$testSnapshotWhenSnapshotHFileCleanerRunning$2(TestSnapshotWhenChoreCleaning.java:185) at java.lang.Thread.run(Thread.java:745) {code} which mean the cleaner is still running even if taking a snapshot, I
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692153#comment-16692153 ] Hadoop QA commented on HBASE-21387: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 32s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 3s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 21s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 29s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 16s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 51s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 12s{color} | {color:red} hbase-server: The patch generated 3 new + 0 unchanged - 0 fixed = 3 total (was 0) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 10s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 9m 38s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}151m 6s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}191m 38s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.snapshot.TestSnapshotWhenChoreCleaning | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12948746/0001-UT.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 429f320e1ba8 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 83dc38a1df | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | checkstyle | https://builds.apache.org/job/PreCommit-HBASE-Build/15072/artifact/patchprocess/diff-checkstyle-hbase-server.txt | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/15072/artifact/patchprocess/patch-unit-hbase-server.txt | | Test Results |
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691920#comment-16691920 ] Zheng Hu commented on HBASE-21387: -- I uploaded draft of the UT, need some time to run correctly with this patch. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, > 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, > 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691174#comment-16691174 ] Zheng Hu commented on HBASE-21387: -- bq. If you have bandwidth, you can give it a try. Fine, Will upload the patch today. Thanks. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, > 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, > 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689656#comment-16689656 ] Ted Yu commented on HBASE-21387: Haven't got around to adding new unit test (without introducing extra synchronization primitive in snapshot classes). Zheng: If you have bandwidth, you can give it a try. Thanks > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, > 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, > 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688996#comment-16688996 ] Zheng Hu commented on HBASE-21387: -- So we can control the region count smaller ? 10^4 is just an example , no need to be exactly 10^4 > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, > 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, > 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688926#comment-16688926 ] Ted Yu commented on HBASE-21387: Looking at Zheng's suggestion for new unit test, bq. another thread to invoke deleteFiles =SnapshotHFileCleaner#getDeletableFiles; Since the in progress snapshot is really long, getUnreferencedFiles(Iterable, SnapshotManager) may detect the in progress snapshot and miss the race condition described in the description. Also, I have never seen unit test creating 10K hfiles. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, > 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, > 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684628#comment-16684628 ] Zheng Hu commented on HBASE-21387: -- Yeah, a UT and two minor fix. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, > 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, > 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684625#comment-16684625 ] Ted Yu commented on HBASE-21387: The only remaining comment was about a new test, right ? Not sure when I can get to it this week. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, > 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, > 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684622#comment-16684622 ] Zheng Hu commented on HBASE-21387: -- Hi [~yuzhih...@gmail.com], I left some comments in RB, If you don't have time to adress those. Maybe i can help to finish > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, > 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, > 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684373#comment-16684373 ] Ted Yu commented on HBASE-21387: TestBlockEvictionFromClient failure was unrelated to patch. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, > 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, > 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684367#comment-16684367 ] Hadoop QA commented on HBASE-21387: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 2s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange} 0m 0s{color} | {color:orange} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 21s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 11s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 56s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 54s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 53s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 53s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}135m 11s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}172m 12s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.client.TestBlockEvictionFromClient | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12947868/21387.v12.txt | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 4903c8e38c78 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 362b5dd259 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe;
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684197#comment-16684197 ] Ted Yu commented on HBASE-21387: TestSnapshotFileCache failed due to NPE, as pointed out by findbugs. TestSaslFanOutOneBlockAsyncDFSOutput failure was due to port in use - not related to the patch. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, > 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, > 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684117#comment-16684117 ] Hadoop QA commented on HBASE-21387: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 3s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange} 0m 0s{color} | {color:orange} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 38s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 7s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 26s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 40s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 21s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 31s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 10m 25s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 42s{color} | {color:red} hbase-server generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}142m 54s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}187m 43s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hbase-server | | | Possible null pointer dereference of snapshotManager in org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.getUnreferencedFiles(Iterable, SnapshotManager) Dereferenced at SnapshotFileCache.java:snapshotManager in org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.getUnreferencedFiles(Iterable, SnapshotManager) Dereferenced at SnapshotFileCache.java:[line 189] | | Failed junit tests | hadoop.hbase.io.asyncfs.TestSaslFanOutOneBlockAsyncDFSOutput | | | hadoop.hbase.master.snapshot.TestSnapshotFileCache | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL |
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683487#comment-16683487 ] Hadoop QA commented on HBASE-21387: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 4m 3s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 3s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange} 0m 0s{color} | {color:orange} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 20s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 22s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 14s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 51s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 43s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 50s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 10m 25s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}273m 33s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 59s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}323m 26s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.client.TestSnapshotTemporaryDirectory | | | hadoop.hbase.client.TestSnapshotTemporaryDirectoryWithRegionReplicas | | | hadoop.hbase.replication.TestReplicationSyncUpTool | | | hadoop.hbase.client.TestFromClientSide3 | | | hadoop.hbase.client.TestFromClientSide | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12947772/21387.v10.txt | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux a788a67d6512 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality |
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683206#comment-16683206 ] Ted Yu commented on HBASE-21387: https://reviews.apache.org/r/69316/ Adding a test may take some time. More than one countdown latch would be needed to control the timing of when snapshot is moved in place. The introduction of the countdown latches, solely for test purposes, seems to be not ideal. BTW I also have HBASE-21246 and HBASE-21466 going in parallel. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683197#comment-16683197 ] Zheng Hu commented on HBASE-21387: -- BTW, will appreciate if put this patch in RB. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683194#comment-16683194 ] Zheng Hu commented on HBASE-21387: -- Sorry for the delay . Skimmed the patch, Seems can fix this bug. some comments: {code} private ExecutorService executorService; + private volatile boolean snapshotAllowed = true; {code} Here, an AtomicBoolean will be better ? and in disbleSnapshot, we can just return snapshotAllowed.getAndSet(false) in CAS. Besides, I think we need a UT to reproduce this bug.. the SnapshotHFileCleaner runing while some snapshots in progress > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682644#comment-16682644 ] Ted Yu commented on HBASE-21387: One more note about why I choose 21387.v9.txt as the version for review: priority is given to taking snapshot versus (delaying) cleaning snapshot files. This is because a failed snapshot has higher visibility compared to delayed snapshot cleaning. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680727#comment-16680727 ] Zheng Hu commented on HBASE-21387: -- Will take a look today , Thanks [~yuzhih...@gmail.com] for ping me. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680721#comment-16680721 ] Ted Yu commented on HBASE-21387: [~openinx][~Apache9][~elserj] : Gentle ping. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678531#comment-16678531 ] Ted Yu commented on HBASE-21387: Here is a brief summary of the approaches I tried, with most recent first - which is expected to be reviewed: 21387.v9.txt : At the beginning of getUnreferencedFiles, snapshot is temporarily disabled. We check whether there is in-flight snapshot. If there is, don't list any file as unreferenced. Otherwise, fill out unreferenced files. During this time, snapshot attempt would be declined. At the end of getUnreferencedFiles, snapshot is enabled. two-pass-cleaner.v9.txt : Cleaner chore stores candidates from previous invocation of the chore. The chore would calculate the intersection of previous candidates and current candidates. The downside of this approach is that the extra candidates from previous iteration consumes (potentially large) memory. 21387.v8.txt : SnapshotFileCache would try to obtain in progress snapshot under the lock. However, since the timing of when in progress snapshot completes is not under the control of SnapshotFileCache, it is hard to avoid race condition. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678516#comment-16678516 ] Josh Elser commented on HBASE-21387: bq. Please take a look at 21387.v9.txt which solves the race condition between in-progress snapshot and hfile cleaner chore. Ted, you've provided multiple solutions already in the form of patches. Please briefly summarize the different approaches you see so that others can give their input without having to read every patch, intimately. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678473#comment-16678473 ] Ted Yu commented on HBASE-21387: [~openinx][~Apache9][~elserj] : Please take a look at 21387.v9.txt which solves the race condition between in-progress snapshot and hfile cleaner chore. Your feedback is welcome. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677718#comment-16677718 ] Hadoop QA commented on HBASE-21387: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 3s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange} 0m 0s{color} | {color:orange} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 30s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 40s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 12s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 49s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 3s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 51s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:red}-1{color} | {color:red} hadoopcheck {color} | {color:red} 6m 49s{color} | {color:red} The patch causes 15 errors with Hadoop v3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}153m 14s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 19s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}189m 10s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12947139/21387.v9.txt | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 86cbbde | | maven | version: Apache Maven 3.0.5 (r01de14724cdef164cd33c7c8c2fe155faf9602da; 2013-02-19 13:51:28+) | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC3 | | hadoopcheck | https://builds.apache.org/job/PreCommit-HBASE-Build/14971/artifact/patchprocess/patch-javac-3.0.0.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/14971/testReport/ | |
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677347#comment-16677347 ] Ted Yu commented on HBASE-21387: In 21387.v9.txt , I propose another approach. At the beginning of getUnreferencedFiles, snapshot is temporarily disabled. We check whether there is in-flight snapshot. If there is, don't list any file as unreferenced. Otherwise, fill out unreferenced files. During this time, snapshot attempt would be declined. At the end of getUnreferencedFiles, snapshot is enabled. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676977#comment-16676977 ] Ted Yu commented on HBASE-21387: In patch v9, I added shouldTrackPreviousRound to FileCleanerDelegate, default to true. For BaseLogCleanerDelegate, the method would return false - since the race condition described in this JIRA doesn't apply to WAL files. There are still 3 subtests in TestCleanerChore that are failing. I want to get people's opinion on this approach. Thanks > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675900#comment-16675900 ] Ted Yu commented on HBASE-21387: In two-pass-cleaner.v5.txt , the reference to previous round is changed to Set. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v5.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675871#comment-16675871 ] Josh Elser commented on HBASE-21387: {quote} bq.where a snapshot was "orphaned" and prevent file cleaning from happening I think by "orphaned" you are talking about not just two iterations for cleaner chore but many iterations. In that case, the situation in the current code base would prevent cleaning hfiles referenced, as well. {quote} Yes, that's the situation I mean. Perhaps we shouldn't be overly concerned about this one. I certainly think it is the cleaner approach (and something we can more easily reason about). > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675867#comment-16675867 ] Ted Yu commented on HBASE-21387: bq. holding onto file names in memory We don't need to continue referencing FileStatus from the previous pass. Path (or String) for each file would be sufficient. bq. where a snapshot was "orphaned" and prevent file cleaning from happening I think by "orphaned" you are talking about not just two iterations for cleaner chore but many iterations. In that case, the situation in the current code base would prevent cleaning hfiles referenced, as well. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675857#comment-16675857 ] Josh Elser commented on HBASE-21387: {quote} Please take a look at two-pass-cleaner.v4.txt where cleaner chore keeps track of the files deemed cleanable from previous iteration. Only files deemed cleanable from previous and current iterations would be deleted. {quote} I think that would help, but I'm not sure if that's the best way to go about it. We're holding onto file names in memory (which could get big) with your two-pass-cleaner.v4 patch. If we made a change to not clean files while there is an in-progress snapshot, we would be in trouble if we ever got into a situation where a snapshot was "orphaned" and prevent file cleaning from happening. I'm not sure which I think is better (or if there's even still something better out there..) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675805#comment-16675805 ] Ted Yu commented on HBASE-21387: Solving the in progress snapshot race condition is tricky. Please take a look at two-pass-cleaner.v4.txt where cleaner chore keeps track of the files deemed cleanable from previous iteration. Only files deemed cleanable from previous and current iterations would be deleted. This is a bigger change. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675799#comment-16675799 ] Josh Elser commented on HBASE-21387: Ok, cool. Thanks for confirming, Ted. I can appreciate where your fix is coming from, but I'm still suspect of this being a complete fix. In the current implementation of SnapshotFileCache#getSnapshotsInProgress(..), we acquire the lock on the in-progress snapshot before listing the files for it. That means the call to getSnapshotsInProgress(..) will block until the operation is complete (both for online and offline snapshot generation). So, we should never have a case where we read a snapshot's files while it's in the process of being written. However, it does seems like there could be a case where a snapshot we knew to be in-progress finishes before the SnapshotFileCleaner "wakes up" (e.g. after TakeSnapshotHandler.completeSnapshot(..) is invoked). We may miss this newly created snapshot (which was in-progress when we started SnapshotFileCache#getUnreferencedFiles but is complete when we finish it). The bigger problem is that there is still the potential for the SnapshotFileCleaner to go to sleep during {{getSnapshotsInProgress}} (in the current code or your patch) and miss a newly started snapshot. I do not think we can safely identify the files to retain for snapshots without precluding the submission of new snapshots. I think your v8 patch improves this situation but does not completely solve it. WDYT? > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675738#comment-16675738 ] Ted Yu commented on HBASE-21387: Thanks for giving the timeline, Josh. The scenario you described is the race condition I am solving with patch v8. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675711#comment-16675711 ] Josh Elser commented on HBASE-21387: Looking through v8.. Started off still struggling to understand the race condition, but I think I see it now. At time T0, we are checking if F1 is referenced. At time T1, there is a snapshot S1 in progress that is referencing a file F1. refreshCache() is called, but no completed snapshot references F1. At T2, the snapshot S1, which references F1, completes. At T3, we check in-progress snapshots and S1 is not included. Thus, F1 is marked as unreferenced even though S1 references it. This is what you are saying is the issue, Ted? > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674474#comment-16674474 ] Hadoop QA commented on HBASE-21387: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 2s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 46s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 6s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 7s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 6s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 44s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 5s{color} | {color:red} hbase-server: The patch generated 1 new + 1 unchanged - 1 fixed = 2 total (was 2) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 2s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 9m 52s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}124m 12s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}163m 48s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946827/21387.v8.txt | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux d49631c257f6 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / ea0f1391f1 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | checkstyle |
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674432#comment-16674432 ] Ted Yu commented on HBASE-21387: Patch v8 adds a boolean, needToCheckInProgressSnapshots, to {{getUnreferencedFiles}} so that the comparison between namesInProgress and snapshotNamesInProgressFromCacheRefresh is only done once. Without the additional boolean, the comparison may be performed many times - once for each file where reference needs to be found out. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673926#comment-16673926 ] Hadoop QA commented on HBASE-21387: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 2s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 47s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 46s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 4s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 4s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 3s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 4s{color} | {color:green} hbase-server: The patch generated 0 new + 1 unchanged - 1 fixed = 1 total (was 2) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 6s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 9m 50s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}121m 46s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}161m 17s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946772/21387.v7.txt | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux fac8a2a4af88 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 62fe365934 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/14941/testReport/ | |
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673884#comment-16673884 ] Hadoop QA commented on HBASE-21387: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 3s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 41s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 46s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 4s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 5s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 1s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 48s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 5s{color} | {color:red} hbase-server: The patch generated 10 new + 1 unchanged - 1 fixed = 11 total (was 2) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 4s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 9m 47s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}122m 16s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}161m 24s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946752/21387.v6.txt | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux fa3e06a28883 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 25c964e9a3 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | checkstyle |
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673793#comment-16673793 ] Ted Yu commented on HBASE-21387: In patch v6, I try to detect discrepancy w.r.t. the number of in progress snapshots from the view of {{refreshCache}} versus from the view from {{getUnreferencedFiles}}. If there is discrepancy, keep the file(s) for the current round. See if this is easier to understand. Thanks > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673480#comment-16673480 ] Ted Yu commented on HBASE-21387: Currently refreshCache has void return type: {code} private synchronized void refreshCache() throws IOException { {code} One potential fix is for {{refreshCache}} to return the name of in progress snapshot. {{getUnreferencedFiles}} stores the returned in progress snapshot name and checks whether the name can be found when calling {{getSnapshotsInProgress}}. If the name no longer appears as in progress snapshot, {{getUnreferencedFiles}} can invoke {{refreshCache}} again. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673467#comment-16673467 ] Ted Yu commented on HBASE-21387: For the unit test, first idea is to use CountDownLatch to reproduce the race condition. Looking for a way to pass CountDownLatch between TakeSnapshotHandler and SnapshotFileCache. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673419#comment-16673419 ] Ted Yu commented on HBASE-21387: Josh, the race condition surrounding in progress snapshot is described in description of the JIRA. Let me try to : * collect relevant SnapshotFileCache log * see if a unit test can be written to reproduce the race condition > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673406#comment-16673406 ] Josh Elser commented on HBASE-21387: [~yuzhih...@gmail.com], I don't understand what you're doing with [^21387.v2.txt]. You filed this issue to fix this apparent race condition, but now you're submitting patches for something else entirely here? > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > which only excludes the temp dir, but not in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo > for the in progress snapshot doesn't include all store file (leaving some > hole in cache). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673171#comment-16673171 ] Ted Yu commented on HBASE-21387: >From https://builds.apache.org/job/PreCommit-HBASE-Build/14932/console : {code} 00:38:23 +1 overall 00:38:23 00:38:23 | Vote | Subsystem | Runtime | Comment 00:38:23 00:38:23 | 0 | reexec | 0m 11s | Docker mode activated. 00:38:23 | 0 | patch | 0m 2s | The patch file was not named according 00:38:23 | | || to hbase's naming conventions. Please 00:38:23 | | || see 00:38:23 | | || https://yetus.apache.org/documentation/0. 00:38:23 | | || 8.0/precommit-patchnames for 00:38:23 | | || instructions. 00:38:23 | | || Prechecks 00:38:23 | +1 | hbaseanti | 0m 0s | Patch does not have any anti-patterns. 00:38:23 | +1 |@author | 0m 0s | The patch does not contain any @author 00:38:23 | | || tags. 00:38:23 | -0 | test4tests | 0m 0s | The patch doesn't appear to include any 00:38:23 | | || new or modified tests. Please justify 00:38:23 | | || why no new tests are needed for this 00:38:23 | | || patch. Also please list what manual 00:38:23 | | || steps were performed to verify this 00:38:23 | | || patch. 00:38:23 | | || master Compile Tests 00:38:23 | +1 | mvninstall | 4m 49s | master passed 00:38:23 | +1 |compile | 1m 46s | master passed 00:38:23 | +1 | checkstyle | 1m 7s | master passed 00:38:23 | +1 | shadedjars | 4m 2s | branch has no errors when building our 00:38:23 | | || shaded downstream artifacts. 00:38:23 | +1 | findbugs | 2m 1s | master passed 00:38:23 | +1 |javadoc | 0m 30s | master passed 00:38:23 | | || Patch Compile Tests 00:38:23 | +1 | mvninstall | 4m 45s | the patch passed 00:38:23 | +1 |compile | 1m 50s | the patch passed 00:38:23 | +1 | javac | 1m 50s | the patch passed 00:38:23 | +1 | checkstyle | 1m 4s | the patch passed 00:38:23 | +1 | whitespace | 0m 0s | The patch has no whitespace issues. 00:38:23 | +1 | shadedjars | 4m 6s | patch has no errors when building our 00:38:23 | | || shaded downstream artifacts. 00:38:24 | +1 |hadoopcheck | 9m 53s | Patch does not cause any errors with 00:38:24 | | || Hadoop 2.7.4 or 3.0.0. 00:38:24 | +1 | findbugs | 2m 11s | the patch passed 00:38:24 | +1 |javadoc | 0m 29s | the patch passed 00:38:24 | | || Other Tests 00:38:24 | +1 | unit | 128m 21s | hbase-server in the patch passed. 00:38:24 | +1 | asflicense | 0m 25s | The patch does not generate ASF License 00:38:24 | | || warnings. 00:38:24 | | | 168m 0s | 00:38:24 00:38:24 00:38:24 || Subsystem || Report/Notes || 00:38:24 00:38:24 | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | 00:38:24 | JIRA Issue | HBASE-21387 | 00:38:24 | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946617/21387.v3.txt | {code} > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672436#comment-16672436 ] Ted Yu commented on HBASE-21387: {code} [ERROR] TestReplicationKillSlaveRSWithSeparateOldWALs.killOneSlaveRS » RetriesExhausted {code} Ran TestReplicationKillSlaveRSWithSeparateOldWALs with patch which passed. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > which only excludes the temp dir, but not in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo > for the in progress snapshot doesn't include all store file (leaving some > hole in cache). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672424#comment-16672424 ] Hadoop QA commented on HBASE-21387: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 2s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange} 0m 0s{color} | {color:orange} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 9s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 57s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 3s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 10s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 10s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 11s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 10m 16s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}255m 26s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}296m 40s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946607/21387.v2.txt | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux d5635b7b7638 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / ee55b558c0 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | unit |
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672406#comment-16672406 ] Hadoop QA commented on HBASE-21387: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 2s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange} 0m 0s{color} | {color:orange} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 49s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 46s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 7s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 2s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 1s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 6s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 9m 53s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}128m 21s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}168m 0s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21387 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946617/21387.v3.txt | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux fe1b3a2329f7 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / ee55b558c0 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | Test Results |
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672271#comment-16672271 ] Ted Yu commented on HBASE-21387: You're right - with the Filter in place, the check is not needed. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > which only excludes the temp dir, but not in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo > for the in progress snapshot doesn't include all store file (leaving some > hole in cache). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672263#comment-16672263 ] Josh Elser commented on HBASE-21387: bq. With proposed change, no in progress snapshot would be included. I must be missing it because I don't understand (looking at branch-1) any functional difference between what is in that code now and what you have changed. I am assuming that you are suggesting the following line also be removed: {code} if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { {code} > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > which only excludes the temp dir, but not in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo > for the in progress snapshot doesn't include all store file (leaving some > hole in cache). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672246#comment-16672246 ] Ted Yu commented on HBASE-21387: Current code would include in progress snapshot(s): {code} FileStatus[] snapshots = FSUtils.listStatus(fs, snapshotDir); {code} With proposed change, no in progress snapshot would be included. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > which only excludes the temp dir, but not in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo > for the in progress snapshot doesn't include all store file (leaving some > hole in cache). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672236#comment-16672236 ] Josh Elser commented on HBASE-21387: {{CompletedSnaphotDirectoriesFilter}} seems to just blacklist directories that match {{SNAPSHOT_TMP_DIR_NAME}}. The change here doesn't seem to be doing anything of consequence (just pushing down the filter into the DFS api). > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > which only excludes the temp dir, but not in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo > for the in progress snapshot doesn't include all store file (leaving some > hole in cache). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)