[jira] [Commented] (HBASE-20952) Re-visit the WAL API
[ https://issues.apache.org/jira/browse/HBASE-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677639#comment-16677639 ] Ted Yu commented on HBASE-20952: >From https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/41// , we >can see that TestIncrementalBackupWithBulkLoad failed for hadoop3 build. This is known issue - see HADOOP-15850. Other than that, the build in HBASE-20952 branch is quite normal. > Re-visit the WAL API > > > Key: HBASE-20952 > URL: https://issues.apache.org/jira/browse/HBASE-20952 > Project: HBase > Issue Type: Improvement > Components: wal >Reporter: Josh Elser >Priority: Major > Attachments: 20952.v1.txt > > > Take a step back from the current WAL implementations and think about what an > HBase WAL API should look like. What are the primitive calls that we require > to guarantee durability of writes with a high degree of performance? > The API needs to take the current implementations into consideration. We > should also have a mind for what is happening in the Ratis LogService (but > the LogService should not dictate what HBase's WAL API looks like RATIS-272). > Other "systems" inside of HBase that use WALs are replication and > backup Replication has the use-case for "tail"'ing the WAL which we > should provide via our new API. B doesn't do anything fancy (IIRC). We > should make sure all consumers are generally going to be OK with the API we > create. > The API may be "OK" (or OK in a part). We need to also consider other methods > which were "bolted" on such as {{AbstractFSWAL}} and > {{WALFileLengthProvider}}. Other corners of "WAL use" (like the > {{WALSplitter}} should also be looked at to use WAL-APIs only). > We also need to make sure that adequate interface audience and stability > annotations are chosen. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Status: Patch Available (was: Open) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Status: Open (was: Patch Available) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Status: Patch Available (was: Open) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: (was: 21387.v9.txt) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: 21387.v9.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Status: Open (was: Patch Available) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Status: Patch Available (was: Open) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677347#comment-16677347 ] Ted Yu commented on HBASE-21387: In 21387.v9.txt , I propose another approach. At the beginning of getUnreferencedFiles, snapshot is temporarily disabled. We check whether there is in-flight snapshot. If there is, don't list any file as unreferenced. Otherwise, fill out unreferenced files. During this time, snapshot attempt would be declined. At the end of getUnreferencedFiles, snapshot is enabled. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: 21387.v9.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21247) Custom Meta WAL Provider doesn't default to custom WAL Provider whose configuration value is outside the enums in Providers
[ https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21247: --- Resolution: Fixed Fix Version/s: 2.1.2 2.2.0 Status: Resolved (was: Patch Available) > Custom Meta WAL Provider doesn't default to custom WAL Provider whose > configuration value is outside the enums in Providers > --- > > Key: HBASE-21247 > URL: https://issues.apache.org/jira/browse/HBASE-21247 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 3.0.0, 2.2.0, 2.1.1, 2.0.2 >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.1.2 > > Attachments: 21247.branch-2.patch, 21247.v1.txt, 21247.v10.txt, > 21247.v11.txt, 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, > 21247.v5.txt, 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt, > HBASE-21247.branch-2.001.patch > > > Currently all the WAL Providers acceptable to hbase are specified in > Providers enum of WALFactory. > This restricts the ability for custom Meta WAL Provider to default to the > custom WAL Provider which is supplied by class name. > This issue fixes the bug by allowing the specification of new WAL Provider > class name using the config "hbase.wal.provider". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21347) Backport HBASE-21200 "Memstore flush doesn't finish because of seekToPreviousRow() in memstore scanner." to branch-1
[ https://issues.apache.org/jira/browse/HBASE-21347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677156#comment-16677156 ] Ted Yu commented on HBASE-21347: Josh: Please go ahead. > Backport HBASE-21200 "Memstore flush doesn't finish because of > seekToPreviousRow() in memstore scanner." to branch-1 > > > Key: HBASE-21347 > URL: https://issues.apache.org/jira/browse/HBASE-21347 > Project: HBase > Issue Type: Sub-task > Components: backport, Scanners >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Critical > Attachments: HBASE-21347.branch-1.001.patch > > > Backport parent issue to branch-1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-25954) Upgrade to Kafka 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-25954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677152#comment-16677152 ] Ted Yu commented on SPARK-25954: Looking at Kafka thread, message from Satish indicated there may be another RC coming. > Upgrade to Kafka 2.1.0 > -- > > Key: SPARK-25954 > URL: https://issues.apache.org/jira/browse/SPARK-25954 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Kafka 2.1.0 RC0 is started. Since this includes official KAFKA-7264 JDK 11 > support, we had better use that. > - > https://lists.apache.org/thread.html/8288f0afdfed4d329f1a8338320b6e24e7684a0593b4bbd6f1b79101@%3Cdev.kafka.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (HBASE-21247) Custom Meta WAL Provider doesn't default to custom WAL Provider whose configuration value is outside the enums in Providers
[ https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21247: --- Attachment: HBASE-21247.branch-2.001.patch > Custom Meta WAL Provider doesn't default to custom WAL Provider whose > configuration value is outside the enums in Providers > --- > > Key: HBASE-21247 > URL: https://issues.apache.org/jira/browse/HBASE-21247 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 3.0.0, 2.2.0, 2.1.1, 2.0.2 >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.0.0 > > Attachments: 21247.branch-2.patch, 21247.v1.txt, 21247.v10.txt, > 21247.v11.txt, 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, > 21247.v5.txt, 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt, > HBASE-21247.branch-2.001.patch > > > Currently all the WAL Providers acceptable to hbase are specified in > Providers enum of WALFactory. > This restricts the ability for custom Meta WAL Provider to default to the > custom WAL Provider which is supplied by class name. > This issue fixes the bug by allowing the specification of new WAL Provider > class name using the config "hbase.wal.provider". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (HBASE-21247) Custom Meta WAL Provider doesn't default to custom WAL Provider whose configuration value is outside the enums in Providers
[ https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu reopened HBASE-21247: > Custom Meta WAL Provider doesn't default to custom WAL Provider whose > configuration value is outside the enums in Providers > --- > > Key: HBASE-21247 > URL: https://issues.apache.org/jira/browse/HBASE-21247 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 3.0.0, 2.2.0, 2.1.1, 2.0.2 >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.0.0 > > Attachments: 21247.branch-2.patch, 21247.v1.txt, 21247.v10.txt, > 21247.v11.txt, 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, > 21247.v5.txt, 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt > > > Currently all the WAL Providers acceptable to hbase are specified in > Providers enum of WALFactory. > This restricts the ability for custom Meta WAL Provider to default to the > custom WAL Provider which is supplied by class name. > This issue fixes the bug by allowing the specification of new WAL Provider > class name using the config "hbase.wal.provider". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21247) Custom Meta WAL Provider doesn't default to custom WAL Provider whose configuration value is outside the enums in Providers
[ https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21247: --- Attachment: 21247.branch-2.patch > Custom Meta WAL Provider doesn't default to custom WAL Provider whose > configuration value is outside the enums in Providers > --- > > Key: HBASE-21247 > URL: https://issues.apache.org/jira/browse/HBASE-21247 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 3.0.0, 2.2.0, 2.1.1, 2.0.2 >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.0.0 > > Attachments: 21247.branch-2.patch, 21247.v1.txt, 21247.v10.txt, > 21247.v11.txt, 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, > 21247.v5.txt, 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt > > > Currently all the WAL Providers acceptable to hbase are specified in > Providers enum of WALFactory. > This restricts the ability for custom Meta WAL Provider to default to the > custom WAL Provider which is supplied by class name. > This issue fixes the bug by allowing the specification of new WAL Provider > class name using the config "hbase.wal.provider". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21247) Custom Meta WAL Provider doesn't default to custom WAL Provider whose configuration value is outside the enums in Providers
[ https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21247: --- Status: Patch Available (was: Reopened) > Custom Meta WAL Provider doesn't default to custom WAL Provider whose > configuration value is outside the enums in Providers > --- > > Key: HBASE-21247 > URL: https://issues.apache.org/jira/browse/HBASE-21247 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 2.0.2, 2.1.1, 3.0.0, 2.2.0 >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.0.0 > > Attachments: 21247.branch-2.patch, 21247.v1.txt, 21247.v10.txt, > 21247.v11.txt, 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, > 21247.v5.txt, 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt > > > Currently all the WAL Providers acceptable to hbase are specified in > Providers enum of WALFactory. > This restricts the ability for custom Meta WAL Provider to default to the > custom WAL Provider which is supplied by class name. > This issue fixes the bug by allowing the specification of new WAL Provider > class name using the config "hbase.wal.provider". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (HBASE-21247) Custom Meta WAL Provider doesn't default to custom WAL Provider whose configuration value is outside the enums in Providers
[ https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu reopened HBASE-21247: > Custom Meta WAL Provider doesn't default to custom WAL Provider whose > configuration value is outside the enums in Providers > --- > > Key: HBASE-21247 > URL: https://issues.apache.org/jira/browse/HBASE-21247 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 3.0.0, 2.2.0, 2.1.1, 2.0.2 >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.0.0 > > Attachments: 21247.branch-2.patch, 21247.v1.txt, 21247.v10.txt, > 21247.v11.txt, 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, > 21247.v5.txt, 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt > > > Currently all the WAL Providers acceptable to hbase are specified in > Providers enum of WALFactory. > This restricts the ability for custom Meta WAL Provider to default to the > custom WAL Provider which is supplied by class name. > This issue fixes the bug by allowing the specification of new WAL Provider > class name using the config "hbase.wal.provider". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21247) Custom Meta WAL Provider doesn't default to custom WAL Provider whose configuration value is outside the enums in Providers
[ https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21247: --- Description: Currently all the WAL Providers acceptable to hbase are specified in Providers enum of WALFactory. This restricts the ability for custom Meta WAL Provider to default to the custom WAL Provider which is supplied by class name. This issue fixes the bug by allowing the specification of new WAL Provider class name using the config "hbase.wal.provider". was: Currently all the WAL Providers acceptable to hbase are specified in Providers enum of WALFactory. This restricts the ability for additional WAL Providers to be supplied - by class name. This issue fixes the bug by allowing the specification of new WAL Provider class name using the config "hbase.wal.provider". > Custom Meta WAL Provider doesn't default to custom WAL Provider whose > configuration value is outside the enums in Providers > --- > > Key: HBASE-21247 > URL: https://issues.apache.org/jira/browse/HBASE-21247 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 3.0.0, 2.2.0, 2.1.1, 2.0.2 > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.0.0 > > Attachments: 21247.v1.txt, 21247.v10.txt, 21247.v11.txt, > 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, 21247.v5.txt, > 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt > > > Currently all the WAL Providers acceptable to hbase are specified in > Providers enum of WALFactory. > This restricts the ability for custom Meta WAL Provider to default to the > custom WAL Provider which is supplied by class name. > This issue fixes the bug by allowing the specification of new WAL Provider > class name using the config "hbase.wal.provider". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21247) Custom Meta WAL Provider doesn't default to custom WAL Provider whose configuration value is outside the enums in Providers
[ https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21247: --- Summary: Custom Meta WAL Provider doesn't default to custom WAL Provider whose configuration value is outside the enums in Providers (was: Custom WAL Provider cannot be specified by configuration whose value is outside the enums in Providers) > Custom Meta WAL Provider doesn't default to custom WAL Provider whose > configuration value is outside the enums in Providers > --- > > Key: HBASE-21247 > URL: https://issues.apache.org/jira/browse/HBASE-21247 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 3.0.0, 2.2.0, 2.1.1, 2.0.2 >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.0.0 > > Attachments: 21247.v1.txt, 21247.v10.txt, 21247.v11.txt, > 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, 21247.v5.txt, > 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt > > > Currently all the WAL Providers acceptable to hbase are specified in > Providers enum of WALFactory. > This restricts the ability for additional WAL Providers to be supplied - by > class name. > This issue fixes the bug by allowing the specification of new WAL Provider > class name using the config "hbase.wal.provider". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676977#comment-16676977 ] Ted Yu commented on HBASE-21387: In patch v9, I added shouldTrackPreviousRound to FileCleanerDelegate, default to true. For BaseLogCleanerDelegate, the method would return false - since the race condition described in this JIRA doesn't apply to WAL files. There are still 3 subtests in TestCleanerChore that are failing. I want to get people's opinion on this approach. Thanks > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: two-pass-cleaner.v9.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, > two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21381: --- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0 Status: Resolved (was: Patch Available) Thanks for the patch, liubang. Thanks for the review, Wei-Chiu > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task > Reporter: Ted Yu >Assignee: liubangchen >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch, > HBASE-21381-3.patch > > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567) > {code} > Backup and Restore uses DistCp to transfer files between clusters. > Without the fix from HADOOP-15850, the transfer would fail. > This issue is to document the hadoop versions which contain HADOOP-15850 so > that user of Backup and Restore feature knows which hadoop versions they can > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21347) Backport HBASE-21200 "Memstore flush doesn't finish because of seekToPreviousRow() in memstore scanner." to branch-1
[ https://issues.apache.org/jira/browse/HBASE-21347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676052#comment-16676052 ] Ted Yu commented on HBASE-21347: lgtm > Backport HBASE-21200 "Memstore flush doesn't finish because of > seekToPreviousRow() in memstore scanner." to branch-1 > > > Key: HBASE-21347 > URL: https://issues.apache.org/jira/browse/HBASE-21347 > Project: HBase > Issue Type: Sub-task > Components: backport, Scanners >Reporter: Toshihiro Suzuki >Assignee: Toshihiro Suzuki >Priority: Critical > Attachments: HBASE-21347.branch-1.001.patch > > > Backport parent issue to branch-1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: two-pass-cleaner.v6.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: (was: two-pass-cleaner.v6.txt) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675900#comment-16675900 ] Ted Yu edited comment on HBASE-21387 at 11/6/18 12:47 AM: -- In two-pass-cleaner.v6.txt , the reference to previous round is changed to Set. The length of file is needed by HFileCleaner was (Author: yuzhih...@gmail.com): In two-pass-cleaner.v5.txt , the reference to previous round is changed to Set. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: two-pass-cleaner.v6.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: (was: two-pass-cleaner.v5.txt) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675900#comment-16675900 ] Ted Yu commented on HBASE-21387: In two-pass-cleaner.v5.txt , the reference to previous round is changed to Set. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v5.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: two-pass-cleaner.v5.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v5.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675867#comment-16675867 ] Ted Yu commented on HBASE-21387: bq. holding onto file names in memory We don't need to continue referencing FileStatus from the previous pass. Path (or String) for each file would be sufficient. bq. where a snapshot was "orphaned" and prevent file cleaning from happening I think by "orphaned" you are talking about not just two iterations for cleaner chore but many iterations. In that case, the situation in the current code base would prevent cleaning hfiles referenced, as well. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675805#comment-16675805 ] Ted Yu commented on HBASE-21387: Solving the in progress snapshot race condition is tricky. Please take a look at two-pass-cleaner.v4.txt where cleaner chore keeps track of the files deemed cleanable from previous iteration. Only files deemed cleanable from previous and current iterations would be deleted. This is a bigger change. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: two-pass-cleaner.v4.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, two-pass-cleaner.v4.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Status: Open (was: Patch Available) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675762#comment-16675762 ] Ted Yu commented on HBASE-21381: liubang: 2.7.x and 2.8.y should be added as supported hadoop releases since they were not affected by HADOOP-11794 > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task > Reporter: Ted Yu >Assignee: liubangchen >Priority: Major > Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch > > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567) > {code} > Backup and Restore uses DistCp to transfer files between clusters. > Without the fix from HADOOP-15850, the transfer would fail. > This issue is to document the hadoop versions which contain HADOOP-15850 so > that user of Backup and Restore feature knows which hadoop versions they can > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675738#comment-16675738 ] Ted Yu commented on HBASE-21387: Thanks for giving the timeline, Josh. The scenario you described is the race condition I am solving with patch v8. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
[ https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21438: --- Attachment: 21438.v1.txt > TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible > -- > > Key: HBASE-21438 > URL: https://issues.apache.org/jira/browse/HBASE-21438 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Attachments: 21438.v1.txt > > > From > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/ > : > {code} > Mon Nov 05 04:52:13 UTC 2018, > RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, > org.apache.hadoop.hbase.procedure2.BadProcedureException: > org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class > org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and > have an empty constructor > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82) > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162) > at > org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
[ https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675573#comment-16675573 ] Ted Yu commented on HBASE-21438: Ran TestAdmin2 with patch which passed. > TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible > -- > > Key: HBASE-21438 > URL: https://issues.apache.org/jira/browse/HBASE-21438 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Attachments: 21438.v1.txt > > > From > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/ > : > {code} > Mon Nov 05 04:52:13 UTC 2018, > RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, > org.apache.hadoop.hbase.procedure2.BadProcedureException: > org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class > org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and > have an empty constructor > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82) > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162) > at > org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
[ https://issues.apache.org/jira/browse/HBASE-21438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21438: --- Status: Patch Available (was: Open) > TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible > -- > > Key: HBASE-21438 > URL: https://issues.apache.org/jira/browse/HBASE-21438 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Attachments: 21438.v1.txt > > > From > https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/ > : > {code} > Mon Nov 05 04:52:13 UTC 2018, > RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, > org.apache.hadoop.hbase.procedure2.BadProcedureException: > org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class > org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and > have an empty constructor > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82) > at > org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162) > at > org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
Ted Yu created HBASE-21438: -- Summary: TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible Key: HBASE-21438 URL: https://issues.apache.org/jira/browse/HBASE-21438 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu >From >https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/ > : {code} Mon Nov 05 04:52:13 UTC 2018, RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, org.apache.hadoop.hbase.procedure2.BadProcedureException: org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and have an empty constructor at org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82) at org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162) at org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249) at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21438) TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible
Ted Yu created HBASE-21438: -- Summary: TestAdmin2#testGetProcedures fails due to FailedProcedure inaccessible Key: HBASE-21438 URL: https://issues.apache.org/jira/browse/HBASE-21438 Project: HBase Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu >From >https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1863/testReport/org.apache.hadoop.hbase.client/TestAdmin2/testGetProcedures/ > : {code} Mon Nov 05 04:52:13 UTC 2018, RpcRetryingCaller{globalStartTime=1541393533029, pause=250, maxAttempts=7}, org.apache.hadoop.hbase.procedure2.BadProcedureException: org.apache.hadoop.hbase.procedure2.BadProcedureException: The procedure class org.apache.hadoop.hbase.procedure2.FailedProcedure must be accessible and have an empty constructor at org.apache.hadoop.hbase.procedure2.ProcedureUtil.validateClass(ProcedureUtil.java:82) at org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProtoProcedure(ProcedureUtil.java:162) at org.apache.hadoop.hbase.master.MasterRpcServices.getProcedures(MasterRpcServices.java:1249) at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675530#comment-16675530 ] Ted Yu commented on HBASE-21381: lgtm > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task > Reporter: Ted Yu >Assignee: liubangchen >Priority: Major > Attachments: HBASE-21381-1.patch, HBASE-21381-2.patch > > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567) > {code} > Backup and Restore uses DistCp to transfer files between clusters. > Without the fix from HADOOP-15850, the transfer would fail. > This issue is to document the hadoop versions which contain HADOOP-15850 so > that user of Backup and Restore feature knows which hadoop versions they can > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675528#comment-16675528 ] Ted Yu edited comment on HBASE-21246 at 11/5/18 6:20 PM: - In patch v24, I dropped the static methods from WALFactory - there are used in test code. I also removed reference to AbstractFSWALProvider in WALFactory since the Reader creation is done by the provider. was (Author: yuzhih...@gmail.com): In patch v24, I dropped the static methods from WALFactory - there are used in test code. I also removed reference to AbstractFSWALProvider since the Reader creation is done by the provider. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.24.txt, 21246.HBASE-20952.001.patch, > 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, > 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, > 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, > wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, > wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675528#comment-16675528 ] Ted Yu commented on HBASE-21246: In patch v24, I dropped the static methods from WALFactory - there are used in test code. I also removed reference to AbstractFSWALProvider since the Reader creation is done by the provider. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.24.txt, 21246.HBASE-20952.001.patch, > 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, > 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, > 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, > wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, > wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21246: --- Attachment: 21246.24.txt > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.24.txt, 21246.HBASE-20952.001.patch, > 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, > 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, > 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, > wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, > wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [VOTE] Apache Ratis Thirdparty 0.1.0rc1
+1 On Sat, Nov 3, 2018 at 10:51 PM Tsz Wo Sze wrote: > +1 > > - Verified the signature and checksums. > - Checked LICENSE, NOTICE and NOTICE in the gz and the jar files > - Checked file names -- all file in > > https://dist.apache.org/repos/dist/dev/incubator/ratis/thirdparty/0.1.0-rc1/ > have "incubating". > > Thanks a lot, Josh! > Tsz-Wo > > On Thu, Nov 1, 2018 at 2:20 AM Josh Elser wrote: > > > > Hi, > > > > Please vote on the following release candidate to become Apache Ratis > > Thirdparty 0.1.0. > > > > The Apache Ratis Thirdparty project is a collection of all thirdparty > > dependencies that Apache Ratis uses, repackaged for optimal use by > > Ratis. As such, there is very little net-new source code in this project. > > > > Over rc0, this RC does: > > > > * Incubating in the file name > > * make_rc.sh updates > > * Includes a DISCLAIMER in generated jar files > > * renames ratis-thirdparty to ratis-thirdparty-misc (not yet reflected > > in ratis.git) > > > > The source release is present at > > > https://dist.apache.org/repos/dist/dev/incubator/ratis/thirdparty/0.1.0-rc1/ > > > > SHA512 checksum is on the source tarball: 2B11B643 836E367E C47D0F64 > > 7750E1AB DE2C3FDE ECF2C825 5F8292FA D4CF2DB4 04BEB4B6 > > 12173754 4C6ACEEE B8534964 C6A4B690 EA9656E2 CAFDB317 FAAB46BA > > > > This source release was created from the Git commit SHA1: > > 896f7b3453e155df96b8ef62b85aa0b92c37d886. For your convenience, there > > is also a GPG-signed tag with the name "ratis-thirdparty-0.1.0rc1" that > > also points at this commit. > > > > This source release was signed with my key: 4677D66C. This is present > > in the KEYS file (dist/dev and dist/release). > > > > The corresponding "binaries" for this release are staged at > > https://repository.apache.org/content/repositories/orgapacheratis-1008/ > and > > will be promoted pending successful PPMC and IPMC votes. You can > > update your local ~/.m2/settings.xml to add this as a repository to > > test the build of ratis.git if you choose (appears to work fine for me). > > > > This vote will be open for at least 72hours (until 2018/11/03 19:00:00 > > GMT). > > > > -- > > > > Here's my +1 (non-binding) > > > > - Josh >
[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675458#comment-16675458 ] Ted Yu commented on HBASE-21246: replication-src-creates-wal-reader.jpg shows how WAL Reader is created for replication source: ReplicationSource calls walProvider#getWalStream which returns WALEntryStream. AbstractWALEntryStream#createReader calls WALProvider#createReader wal-splitter-reader.jpg shows how WAL Reader is created for log splitting: WALSplitter#getReader calls WALProvider#createReader. Below WALProvider, AbstractFSWALProvider and DisabledWALProvider are shown which implement WALProvider interface. AsyncFSWALProvider and FSHLogProvider extend AbstractFSWALProvider wal-splitter-writer.jpg shows how WAL Writer is created for log splitting. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > replication-src-creates-wal-reader.jpg, wal-factory-providers.png, > wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21246: --- Attachment: replication-src-creates-wal-reader.jpg > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > replication-src-creates-wal-reader.jpg, wal-factory-providers.png, > wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21246: --- Attachment: wal-splitter-writer.jpg wal-splitter-reader.jpg > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > replication-src-creates-wal-reader.jpg, wal-factory-providers.png, > wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21247) Custom WAL Provider cannot be specified by configuration whose value is outside the enums in Providers
[ https://issues.apache.org/jira/browse/HBASE-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21247: --- Resolution: Fixed Status: Resolved (was: Patch Available) Thanks for the review, Sean and Josh. > Custom WAL Provider cannot be specified by configuration whose value is > outside the enums in Providers > -- > > Key: HBASE-21247 > URL: https://issues.apache.org/jira/browse/HBASE-21247 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: 3.0.0 > > Attachments: 21247.v1.txt, 21247.v10.txt, 21247.v11.txt, > 21247.v2.txt, 21247.v3.txt, 21247.v4.tst, 21247.v4.txt, 21247.v5.txt, > 21247.v6.txt, 21247.v7.txt, 21247.v8.txt, 21247.v9.txt > > > Currently all the WAL Providers acceptable to hbase are specified in > Providers enum of WALFactory. > This restricts the ability for additional WAL Providers to be supplied - by > class name. > This issue fixes the bug by allowing the specification of new WAL Provider > class name using the config "hbase.wal.provider". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674650#comment-16674650 ] Ted Yu commented on HBASE-21381: {code} 61 * 3.2.0+ 62 * 2.9.2+ 63 * 3.0.4+ {code} The hadoop versions are not sorted. Normally it is easier to find the hadoop version the user is deploying if the versions are in sorted order. > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task > Reporter: Ted Yu >Assignee: liubangchen >Priority: Major > Attachments: HBASE-21381-1.patch > > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567) > {code} > Backup and Restore uses DistCp to transfer files between clusters. > Without the fix from HADOOP-15850, the transfer would fail. > This issue is to document the hadoop versions which contain HADOOP-15850 so > that user of Backup and Restore feature knows which hadoop versions they can > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674432#comment-16674432 ] Ted Yu commented on HBASE-21387: Patch v8 adds a boolean, needToCheckInProgressSnapshots, to {{getUnreferencedFiles}} so that the comparison between namesInProgress and snapshotNamesInProgressFromCacheRefresh is only done once. Without the additional boolean, the comparison may be performed many times - once for each file where reference needs to be found out. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: 21387.v8.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: 21387.v7.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Status: Patch Available (was: Open) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673793#comment-16673793 ] Ted Yu commented on HBASE-21387: In patch v6, I try to detect discrepancy w.r.t. the number of in progress snapshots from the view of {{refreshCache}} versus from the view from {{getUnreferencedFiles}}. If there is discrepancy, keep the file(s) for the current round. See if this is easier to understand. Thanks > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: 21387.v6.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673480#comment-16673480 ] Ted Yu commented on HBASE-21387: Currently refreshCache has void return type: {code} private synchronized void refreshCache() throws IOException { {code} One potential fix is for {{refreshCache}} to return the name of in progress snapshot. {{getUnreferencedFiles}} stores the returned in progress snapshot name and checks whether the name can be found when calling {{getSnapshotsInProgress}}. If the name no longer appears as in progress snapshot, {{getUnreferencedFiles}} can invoke {{refreshCache}} again. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673467#comment-16673467 ] Ted Yu commented on HBASE-21387: For the unit test, first idea is to use CountDownLatch to reproduce the race condition. Looking for a way to pass CountDownLatch between TakeSnapshotHandler and SnapshotFileCache. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: 21387.dbg.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: (was: 21387.v1.txt) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673419#comment-16673419 ] Ted Yu commented on HBASE-21387: Josh, the race condition surrounding in progress snapshot is described in description of the JIRA. Let me try to : * collect relevant SnapshotFileCache log * see if a unit test can be written to reproduce the race condition > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Description: During recent report from customer where ExportSnapshot failed: {code} 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] snapshot.SnapshotReferenceUtil: Can't find hfile: 44f6c3c646e84de6a63fe30da4fcb3aa in the real (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) or archive (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) directory for the primary table. {code} We found the following in log: {code} 2018-10-09 18:54:23,675 DEBUG [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] cleaner.HFileCleaner: Removing: hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa from archive {code} The root cause is race condition surrounding in progress snapshot(s) handling between refreshCache() and getUnreferencedFiles(). There are two callers of refreshCache: one from RefreshCacheTask#run and the other from SnapshotHFileCleaner. Let's look at the code of refreshCache: {code} if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { {code} whose intention is to exclude in progress snapshot(s). Suppose when the RefreshCacheTask runs refreshCache, there is some in progress snapshot (about to finish). When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that lastModifiedTime is up to date. So cleaner proceeds to check in progress snapshot(s). However, the snapshot has completed by that time, resulting in some file(s) deemed unreferenced. was: During recent report from customer where ExportSnapshot failed: {code} 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] snapshot.SnapshotReferenceUtil: Can't find hfile: 44f6c3c646e84de6a63fe30da4fcb3aa in the real (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) or archive (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) directory for the primary table. {code} We found the following in log: {code} 2018-10-09 18:54:23,675 DEBUG [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] cleaner.HFileCleaner: Removing: hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa from archive {code} The root cause is race condition surrounding in progress snapshot(s) handling between refreshCache() and getUnreferencedFiles(). There are two callers of refreshCache: one from RefreshCacheTask#run and the other from SnapshotHFileCleaner. Let's look at the code of refreshCache: {code} if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { {code} which only excludes the temp dir, but not in progress snapshot(s). Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo for the in progress snapshot doesn't include all store file (leaving some hole in cache). When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that lastModifiedTime is up to date. So cleaner proceeds to check in progress snapshot(s). However, the snapshot has completed by that time, resulting in some file(s) deemed unreferenced. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Status: Open (was: Patch Available) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673171#comment-16673171 ] Ted Yu commented on HBASE-21387: >From https://builds.apache.org/job/PreCommit-HBASE-Build/14932/console : {code} 00:38:23 +1 overall 00:38:23 00:38:23 | Vote | Subsystem | Runtime | Comment 00:38:23 00:38:23 | 0 | reexec | 0m 11s | Docker mode activated. 00:38:23 | 0 | patch | 0m 2s | The patch file was not named according 00:38:23 | | || to hbase's naming conventions. Please 00:38:23 | | || see 00:38:23 | | || https://yetus.apache.org/documentation/0. 00:38:23 | | || 8.0/precommit-patchnames for 00:38:23 | | || instructions. 00:38:23 | | || Prechecks 00:38:23 | +1 | hbaseanti | 0m 0s | Patch does not have any anti-patterns. 00:38:23 | +1 |@author | 0m 0s | The patch does not contain any @author 00:38:23 | | || tags. 00:38:23 | -0 | test4tests | 0m 0s | The patch doesn't appear to include any 00:38:23 | | || new or modified tests. Please justify 00:38:23 | | || why no new tests are needed for this 00:38:23 | | || patch. Also please list what manual 00:38:23 | | || steps were performed to verify this 00:38:23 | | || patch. 00:38:23 | | || master Compile Tests 00:38:23 | +1 | mvninstall | 4m 49s | master passed 00:38:23 | +1 |compile | 1m 46s | master passed 00:38:23 | +1 | checkstyle | 1m 7s | master passed 00:38:23 | +1 | shadedjars | 4m 2s | branch has no errors when building our 00:38:23 | | || shaded downstream artifacts. 00:38:23 | +1 | findbugs | 2m 1s | master passed 00:38:23 | +1 |javadoc | 0m 30s | master passed 00:38:23 | | || Patch Compile Tests 00:38:23 | +1 | mvninstall | 4m 45s | the patch passed 00:38:23 | +1 |compile | 1m 50s | the patch passed 00:38:23 | +1 | javac | 1m 50s | the patch passed 00:38:23 | +1 | checkstyle | 1m 4s | the patch passed 00:38:23 | +1 | whitespace | 0m 0s | The patch has no whitespace issues. 00:38:23 | +1 | shadedjars | 4m 6s | patch has no errors when building our 00:38:23 | | || shaded downstream artifacts. 00:38:24 | +1 |hadoopcheck | 9m 53s | Patch does not cause any errors with 00:38:24 | | || Hadoop 2.7.4 or 3.0.0. 00:38:24 | +1 | findbugs | 2m 11s | the patch passed 00:38:24 | +1 |javadoc | 0m 29s | the patch passed 00:38:24 | | || Other Tests 00:38:24 | +1 | unit | 128m 21s | hbase-server in the patch passed. 00:38:24 | +1 | asflicense | 0m 25s | The patch does not generate ASF License 00:38:24 | | || warnings. 00:38:24 | | | 168m 0s | 00:38:24 00:38:24 00:38:24 || Subsystem || Report/Notes || 00:38:24 00:38:24 | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | 00:38:24 | JIRA Issue | HBASE-21387 | 00:38:24 | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12946617/21387.v3.txt | {code} > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbas
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672436#comment-16672436 ] Ted Yu commented on HBASE-21387: {code} [ERROR] TestReplicationKillSlaveRSWithSeparateOldWALs.killOneSlaveRS » RetriesExhausted {code} Ran TestReplicationKillSlaveRSWithSeparateOldWALs with patch which passed. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > which only excludes the temp dir, but not in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo > for the in progress snapshot doesn't include all store file (leaving some > hole in cache). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670472#comment-16670472 ] Ted Yu edited comment on HBASE-21246 at 11/2/18 12:27 AM: -- wal-providers.png is diagram for class hierarchy between WALProvider interface and implementing WAL Provider classes. wal-factory-providers.png is diagram involving WALProvider related classes and WALFactory class. It also shows WALIdentity (and FSWALIdentity) which replaces Path in the existing WAL APIs. In the center of the upper half of the diagram is WALFactory whose functionality is to create WALProvider instances. WALSplitter uses the WALProvider instance created by WALFactory to access WAL. WALSplitter previously refers to the file being split using FileStatus. Now it uses WALIdentity to refer to the entity being split. Below WALIdentity is FSWALIdentity which implements WALIdentity and represents distributed FileSystem based identity (with Path field). To the left of WALFactory is the WALProvider interface. The interface is implemented by the following classes: * RegionGroupingProvider * AbstractFSWALProvider * SyncReplicationWALProvider * DisabledWALProvider The AsyncFSWALProvider and FSHLogProvider classes build on top of (extends) AbstractFSWALProvider. The refactored WAL API, as shown in these diagrams, illustrate how we abstract from distributed FileSystem-centric concepts. was (Author: yuzhih...@gmail.com): wal-providers.png is diagram for class hierarchy between WALProvider interface and implementing WAL Provider classes. wal-factory-providers.png is diagram involving WALProvider related classes and WALFactory class. It also shows WALIdentity (and FSWALIdentity) which replaces Path in the existing WAL APIs. In the center of the upper half of the diagram is WALFactory whose functionality is to create WALProvider instances. WALSplitter uses the WALProvider instance created by WALFactory to access WAL. WALSplitter previously refers to the file being split using FileStatus. Now it uses WALIdentity to refer to the entity being split. Below WALIdentity is FSWALIdentity which implements WALIdentity and represents distributed FileSystem based identity (with Path field). To the left of WALFactory is the WALProvider interface. The interface is implemented by the following classes: * RegionGroupingProvider * AbstractFSWALProvider * SyncReplicationWALProvider * DisabledWALProvider The AsyncFSWALProvider class builds on top of (extends) AbstractFSWALProvider. The refactored WAL API, as shown in these diagrams, illustrate how we abstract from distributed FileSystem-centric concepts. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > wal-factory-providers.png, wal-providers.png > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670472#comment-16670472 ] Ted Yu edited comment on HBASE-21246 at 11/2/18 12:04 AM: -- wal-providers.png is diagram for class hierarchy between WALProvider interface and implementing WAL Provider classes. wal-factory-providers.png is diagram involving WALProvider related classes and WALFactory class. It also shows WALIdentity (and FSWALIdentity) which replaces Path in the existing WAL APIs. In the center of the upper half of the diagram is WALFactory whose functionality is to create WALProvider instances. WALSplitter uses the WALProvider instance created by WALFactory to access WAL. WALSplitter previously refers to the file being split using FileStatus. Now it uses WALIdentity to refer to the entity being split. Below WALIdentity is FSWALIdentity which implements WALIdentity and represents distributed FileSystem based identity (with Path field). To the left of WALFactory is the WALProvider interface. The interface is implemented by the following classes: * RegionGroupingProvider * AbstractFSWALProvider * SyncReplicationWALProvider * DisabledWALProvider The AsyncFSWALProvider class builds on top of (extends) AbstractFSWALProvider. The refactored WAL API, as shown in these diagrams, illustrate how we abstract from distributed FileSystem-centric concepts. was (Author: yuzhih...@gmail.com): wal-providers.png is diagram for class hierarchy between WALProvider interface and implementing WAL Provider classes. wal-factory-providers.png is diagram involving WALProvider related classes and WALFactory class. It also shows WALIdentity (and FSWALIdentity) which replaces Path in the existing WAL APIs. In the center of the upper half of the diagram is WALFactory whose functionality is to create WALProvider instances. WALSplitter uses the WALProvider instance created by WALFactory to access WAL. To the left of WALFactory is the WALProvider interface. The interface is implemented by the following classes: * RegionGroupingProvider * AbstractFSWALProvider * SyncReplicationWALProvider * DisabledWALProvider The AsyncFSWALProvider class builds on top of (extends) AbstractFSWALProvider. The refactored WAL API, as shown in these diagrams, illustrate how we abstract from distributed FileSystem-centric concepts. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > wal-factory-providers.png, wal-providers.png > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670472#comment-16670472 ] Ted Yu edited comment on HBASE-21246 at 11/1/18 11:58 PM: -- wal-providers.png is diagram for class hierarchy between WALProvider interface and implementing WAL Provider classes. wal-factory-providers.png is diagram involving WALProvider related classes and WALFactory class. It also shows WALIdentity (and FSWALIdentity) which replaces Path in the existing WAL APIs. In the center of the upper half of the diagram is WALFactory whose functionality is to create WALProvider instances. WALSplitter uses the WALProvider instance created by WALFactory to access WAL. To the left of WALFactory is the WALProvider interface. The interface is implemented by the following classes: * RegionGroupingProvider * AbstractFSWALProvider * SyncReplicationWALProvider * DisabledWALProvider The AsyncFSWALProvider class builds on top of (extends) AbstractFSWALProvider. The refactored WAL API, as shown in these diagrams, illustrate how we abstract from distributed FileSystem-centric concepts. was (Author: yuzhih...@gmail.com): wal-providers.png is diagram for class hierarchy between WALProvider interface and implementing WAL Provider classes. wal-factory-providers.png is diagram involving WALProvider related classes and WALFactory class. It also shows WALIdentity (and FSWALIdentity) which replaces Path in the existing WAL APIs. The refactored WAL API, as shown in these diagrams, illustrate how we abstract from distributed FileSystem-centric concepts. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > wal-factory-providers.png, wal-providers.png > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672271#comment-16672271 ] Ted Yu commented on HBASE-21387: You're right - with the Filter in place, the check is not needed. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > which only excludes the temp dir, but not in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo > for the in progress snapshot doesn't include all store file (leaving some > hole in cache). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: 21387.v3.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt, 21387.v3.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > which only excludes the temp dir, but not in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo > for the in progress snapshot doesn't include all store file (leaving some > hole in cache). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672246#comment-16672246 ] Ted Yu commented on HBASE-21387: Current code would include in progress snapshot(s): {code} FileStatus[] snapshots = FSUtils.listStatus(fs, snapshotDir); {code} With proposed change, no in progress snapshot would be included. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > which only excludes the temp dir, but not in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo > for the in progress snapshot doesn't include all store file (leaving some > hole in cache). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Summary: Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files (was: Race condition in snapshot cache refreshing leads to loss of snapshot files) > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > which only excludes the temp dir, but not in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo > for the in progress snapshot doesn't include all store file (leaving some > hole in cache). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition in snapshot cache refreshing leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: 21387.v2.txt > Race condition in snapshot cache refreshing leads to loss of snapshot files > --- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt, 21387.v2.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > which only excludes the temp dir, but not in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo > for the in progress snapshot doesn't include all store file (leaving some > hole in cache). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition in snapshot cache refreshing leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Description: During recent report from customer where ExportSnapshot failed: {code} 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] snapshot.SnapshotReferenceUtil: Can't find hfile: 44f6c3c646e84de6a63fe30da4fcb3aa in the real (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) or archive (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) directory for the primary table. {code} We found the following in log: {code} 2018-10-09 18:54:23,675 DEBUG [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] cleaner.HFileCleaner: Removing: hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa from archive {code} The root cause is race condition surrounding in progress snapshot(s) handling between refreshCache() and getUnreferencedFiles(). There are two callers of refreshCache: one from RefreshCacheTask#run and the other from SnapshotHFileCleaner. Let's look at the code of refreshCache: {code} if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { {code} which only excludes the temp dir, but not in progress snapshot(s). Suppose when the RefreshCacheTask runs refreshCache, SnapshotDirectoryInfo for the in progress snapshot doesn't include all store file (leaving some hole in cache). When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that lastModifiedTime is up to date. So cleaner proceeds to check in progress snapshot(s). However, the snapshot has completed by that time, resulting in some file(s) deemed unreferenced. was: During recent report from customer where ExportSnapshot failed: {code} 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] snapshot.SnapshotReferenceUtil: Can't find hfile: 44f6c3c646e84de6a63fe30da4fcb3aa in the real (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) or archive (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) directory for the primary table. {code} We found the following in log: {code} 2018-10-09 18:54:23,675 DEBUG [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] cleaner.HFileCleaner: Removing: hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa from archive {code} The root cause is race condition surrounding SnapshotFileCache#refreshCache(). There are two callers of refreshCache: one from RefreshCacheTask#run and the other from SnapshotHFileCleaner. Let's look at the code of refreshCache: {code} // if the snapshot directory wasn't modified since we last check, we are done if (dirStatus.getModificationTime() <= this.lastModifiedTime) return; // 1. update the modified time this.lastModifiedTime = dirStatus.getModificationTime(); // 2.clear the cache this.cache.clear(); {code} Suppose the RefreshCacheTask runs past the if check and sets this.lastModifiedTime The cleaner executes refreshCache and returns immediately since this.lastModifiedTime matches the modification time of the directory. Now RefreshCacheTask clears the cache. By the time the cleaner performs cache lookup, the cache is empty. Therefore cleaner puts the file into unReferencedFiles - leading to data loss. > Race condition in snapshot cache refreshing leads to loss of snapshot files > --- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from Refres
[jira] [Commented] (HBASE-21387) Race condition in snapshot cache refreshing leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672059#comment-16672059 ] Ted Yu commented on HBASE-21387: Thanks for the review, Josh. The cache was introduced by HBASE-6865. Let me dig some more in order to better assess the relationship between the callers of refreshCache(). Meanwhile, I was looking at another aspect - in progress snapshot(s). Note the existing check: {code} if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { {code} which only excludes the temp dir, but not in progress snapshot(s). I think something such as the following would be more appropriate : {code} diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/ index 358b4ea..c303667 100644 --- a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java @@ -232,7 +232,8 @@ public class SnapshotFileCache implements Stoppable { Map known = new HashMap<>(); // 3. check each of the snapshot directories -FileStatus[] snapshots = FSUtils.listStatus(fs, snapshotDir); +FileStatus[] snapshots = fs.listStatus(snapshotDir, +new SnapshotDescriptionUtils.CompletedSnaphotDirectoriesFilter(fs)); if (snapshots == null) { // remove all the remembered snapshots because we don't have any left if (LOG.isDebugEnabled() && this.snapshots.size() > 0) { {code} > Race condition in snapshot cache refreshing leads to loss of snapshot files > --- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding SnapshotFileCache#refreshCache(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > // if the snapshot directory wasn't modified since we last check, we are > done > if (dirStatus.getModificationTime() <= this.lastModifiedTime) return; > // 1. update the modified time > this.lastModifiedTime = dirStatus.getModificationTime(); > // 2.clear the cache > this.cache.clear(); > {code} > Suppose the RefreshCacheTask runs past the if check and sets > this.lastModifiedTime > The cleaner executes refreshCache and returns immediately since > this.lastModifiedTime matches the modification time of the directory. > Now RefreshCacheTask clears the cache. By the time the cleaner performs cache > lookup, the cache is empty. > Therefore cleaner puts the file into unReferencedFiles - leading to data loss. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SOLR-7381) Improve Debuggability of SolrCloud using MDC
[ https://issues.apache.org/jira/browse/SOLR-7381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671925#comment-16671925 ] Ted Yu commented on SOLR-7381: -- I was looking at MDCAwareFixedThreadPool and found this JIRA. I wonder if what was stated here is relevant: http://ashtonkemerling.com/blog/2017/09/01/mdc-and-threadpools/ Thanks > Improve Debuggability of SolrCloud using MDC > > > Key: SOLR-7381 > URL: https://issues.apache.org/jira/browse/SOLR-7381 > Project: Solr > Issue Type: Improvement > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Shalin Shekhar Mangar >Priority: Critical > Fix For: 5.2, 6.0 > > Attachments: SOLR-7381-forbid-threadpoolexecutor.patch, > SOLR-7381-submitter-stacktrace.patch, SOLR-7381-thread-names.patch, > SOLR-7381-thread-names.patch, SOLR-7381-thread-names.patch, SOLR-7381.patch, > SOLR-7381.patch > > > SOLR-6673 added MDC based logging in a few places but we have a lot of ground > to cover. > # Threads created via thread pool executors do not inherit MDC values and > those are some of the most interesting places to log MDC context. > # We must expose node names (in tests) so that we can debug faster > # We can expose more information via thread names so that a thread dump has > enough context to help debug problems in production > This is critical to help debug SolrCloud failures. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671880#comment-16671880 ] Ted Yu commented on HBASE-21246: bq. Why as attachments to the issue and not integrated into design doc? Due to the size of wal-factory-providers.png , it is not suitable to be embedded in design doc. Links to these two diagrams, along with image of wal-providers.png, have been added to page 8 of design doc. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > wal-factory-providers.png, wal-providers.png > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670472#comment-16670472 ] Ted Yu edited comment on HBASE-21246 at 11/1/18 5:03 PM: - wal-providers.png is diagram for class hierarchy between WALProvider interface and implementing WAL Provider classes. wal-factory-providers.png is diagram involving WALProvider related classes and WALFactory class. It also shows WALIdentity (and FSWALIdentity) which replaces Path in the existing WAL APIs. The refactored WAL API, as shown in these diagrams, illustrate how we abstract from distributed FileSystem-centric concepts. was (Author: yuzhih...@gmail.com): wal-providers.png is diagram for WALProvider related classes. wal-factory-providers.png is diagram involving WALProvider related classes and WALFactory class. It also shows WALIdentity (and FSWALIdentity) which replaces Path in the existing WAL APIs. The refactored WAL API, as shown in these diagrams, illustrate how we abstract from distributed FileSystem-centric concepts. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > wal-factory-providers.png, wal-providers.png > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670472#comment-16670472 ] Ted Yu edited comment on HBASE-21246 at 11/1/18 4:26 PM: - wal-providers.png is diagram for WALProvider related classes. wal-factory-providers.png is diagram involving WALProvider related classes and WALFactory class. It also shows WALIdentity (and FSWALIdentity) which replaces Path in the existing WAL APIs. The refactored WAL API, as shown in these diagrams, illustrate how we abstract from distributed FileSystem-centric concepts. was (Author: yuzhih...@gmail.com): wal-providers.png is diagram for WALProvider related classes. wal-factory-providers.png is diagram involving WALProvider related classes and WALFactory class. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > wal-factory-providers.png, wal-providers.png > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21418) Reduce a number of reseek operations in MemstoreScanner when seek point is close to the current row.
[ https://issues.apache.org/jira/browse/HBASE-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671750#comment-16671750 ] Ted Yu commented on HBASE-21418: For the new test, I ran it without the rest of the patch: {code} Running org.apache.hadoop.hbase.client.TestLookAheadBeforeReseek Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 24.647 sec - in org.apache.hadoop.hbase.client.TestLookAheadBeforeReseek {code} What is TestLookAheadBeforeReseek supposed to show without the fix ? > Reduce a number of reseek operations in MemstoreScanner when seek point is > close to the current row. > > > Key: HBASE-21418 > URL: https://issues.apache.org/jira/browse/HBASE-21418 > Project: HBase > Issue Type: Improvement > Components: scan, Scanners >Affects Versions: 1.2.5 >Reporter: Jeongdae Kim >Assignee: Jeongdae Kim >Priority: Minor > Labels: performance > Attachments: HBASE-21418.branch-1.2.001.patch > > > We observed “responseTooSlow” logs for Get requests in our production > clusters. even some get requests were responded after 10 seconds. > Affected get requests were done with the timerange, and target rows have many > columns that have some versions. > We reproduced this issue, and found this behavior happens only when scanning > in the memstore. after flushing the HStore, this slow response issue for Get > disappeared and all same get requests are responded very quickly. > > We investigated this case, and found this performance difference between > memstore scanner and hfile scanner is caused by the number of reseek > operations executed while scanning. When a store scanner needs to reseek the > next column, Hfile scanner wisely decide whether it have to reseek or not by > checking the seek point is in current block, whereas memstore scanner just do > reseek without decision unlike Hfile scanner. In our case, almost all columns > in the memstore have older timestamp than scan(get)’s timerange, and so many > reseek operations occur as much as about the number of columns. This results > in increasing the response time of Get requests sporadically. > > To improve the reseek operation of the memstore scanner, i think it’s better > skipping than seeking when reseek requested, if seek point is quite close to > current cell that the scanner is pointing now.(Actually, i changed > MatchCode.SEEK_NEXT_COL to MatchCode.Skip in our case, and the response time > of Get was 6x faster than before) But we can’t decide whether seek point is > close to the current cell or not, because memstore scannner has no > information such as next block index. > Before HBASE-13109, Scan.HINT_LOOKAHEAD was introduced to handle like this > case, and it may be deprecated someday. But, i think that hint is still be > useful for the memstore scanner to try to skip first, before reseeking, and > with this option we can make reseek operations of memstore scanner smarter. > > I tested this patch in our case, and got the same result as i changed > matchcode (mentioned above). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition in snapshot cache refreshing leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671033#comment-16671033 ] Ted Yu commented on HBASE-21387: In the old code, to simulate the race condition, we can use CountDownLatch. Here is a sketch: {code} diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/ index 358b4ea..2941400 100644 --- a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java @@ -27,6 +27,7 @@ import java.util.Map; import java.util.Set; import java.util.Timer; import java.util.TimerTask; +import java.util.concurrent.CountDownLatch; import java.util.concurrent.locks.ReentrantLock; import org.apache.hadoop.conf.Configuration; @@ -92,6 +93,8 @@ public class SnapshotFileCache implements Stoppable { private final SnapshotFileInspector fileInspector; private final Path snapshotDir; private final Set cache = new HashSet<>(); + private final CountDownLatch latchRefresh = new CountDownLatch(1); + private final CountDownLatch latchContains = new CountDownLatch(1); /** * This is a helper map of information about the snapshot directories so we don't need to rescan * them if they haven't changed since the last time we looked. @@ -180,16 +183,18 @@ public class SnapshotFileCache implements Stoppable { // cache, but that seems overkill at the moment and isn't necessarily a bottleneck. public synchronized Iterable getUnreferencedFiles(Iterable files, final SnapshotManager snapshotManager) - throws IOException { + throws IOException, InterruptedException { List unReferencedFiles = Lists.newArrayList(); List snapshotsInProgress = null; boolean refreshed = false; for (FileStatus file : files) { String fileName = file.getPath().getName(); if (!refreshed && !cache.contains(fileName)) { +latchRefresh.await(); refreshCache(); refreshed = true; } + latchContains.await(); if (cache.contains(fileName)) { continue; } @@ -226,9 +231,11 @@ public class SnapshotFileCache implements Stoppable { // 1. update the modified time this.lastModifiedTime = dirStatus.getModificationTime(); +latchRefresh.countDown(); // 2.clear the cache this.cache.clear(); +latchContains.countDown(); Map known = new HashMap<>(); // 3. check each of the snapshot directories {code} With the fix, the race condition is gone. > Race condition in snapshot cache refreshing leads to loss of snapshot files > --- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding SnapshotFileCache#refreshCache(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > // if the snapshot directory wasn't modified since we last check, we are > done > if (dirStatus.getModificationTime() <= this.lastModifiedTime) return; > // 1. update the modified time > this.lastModifiedTime = dirStatus.getModificationTime(); > // 2.clear the cache > this.cache.clear(); > {code} > Suppose the RefreshCacheTask runs past the if check and sets > this.lastModifiedTime > The cleaner executes refreshCache and returns immediately since > this.lastModifiedTime matches the modification time of the directory. > Now RefreshCacheTask clears the cache. By th
[jira] [Created] (SOLR-12950) Consolidate the comparator in IndexSizeTrigger#run
Ted Yu created SOLR-12950: - Summary: Consolidate the comparator in IndexSizeTrigger#run Key: SOLR-12950 URL: https://issues.apache.org/jira/browse/SOLR-12950 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Reporter: Ted Yu Currently IndexSizeTrigger#run uses two comparators for sorting. They retrieve DOCS_SIZE_PROP from replica and present different order for the sorting. It seems defining one comparator should be enough. The other can be expressed with Collections.reverseOrder of the first one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-12949) metricTags Map in IndexSizeTrigger#run can be created outside the for loop
Ted Yu created SOLR-12949: - Summary: metricTags Map in IndexSizeTrigger#run can be created outside the for loop Key: SOLR-12949 URL: https://issues.apache.org/jira/browse/SOLR-12949 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Reporter: Ted Yu {code} for (String node : clusterState.getLiveNodes()) { Map metricTags = new HashMap<>(); {code} The metricTags Map can be created outside the for loop. At the beginning of each iteration, metricTags Map should be cleared. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (HBASE-21416) Intermittent TestRegionInfoDisplay failure due to shift in relTime of RegionState#toDescriptiveString
Ted Yu created HBASE-21416: -- Summary: Intermittent TestRegionInfoDisplay failure due to shift in relTime of RegionState#toDescriptiveString Key: HBASE-21416 URL: https://issues.apache.org/jira/browse/HBASE-21416 Project: HBase Issue Type: Test Reporter: Ted Yu Over https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-2.1/1799/testReport/junit/org.apache.hadoop.hbase.client/TestRegionInfoDisplay/testRegionDetailsForDisplay/ : {code} org.junit.ComparisonFailure: expected:<...:30 UTC 2018 (PT0.00[6]S ago), server=null> but was:<...:30 UTC 2018 (PT0.00[7]S ago), server=null> at org.apache.hadoop.hbase.client.TestRegionInfoDisplay.testRegionDetailsForDisplay(TestRegionInfoDisplay.java:78) {code} Here is how toDescriptiveString composes relTime: {code} long relTime = System.currentTimeMillis() - stamp; {code} In the test, state.toDescriptiveString() is called twice for the assertion where different return values from System.currentTimeMillis() caused the assertion to fail in the above occasion. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21416) Intermittent TestRegionInfoDisplay failure due to shift in relTime of RegionState#toDescriptiveString
Ted Yu created HBASE-21416: -- Summary: Intermittent TestRegionInfoDisplay failure due to shift in relTime of RegionState#toDescriptiveString Key: HBASE-21416 URL: https://issues.apache.org/jira/browse/HBASE-21416 Project: HBase Issue Type: Test Reporter: Ted Yu Over https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-2.1/1799/testReport/junit/org.apache.hadoop.hbase.client/TestRegionInfoDisplay/testRegionDetailsForDisplay/ : {code} org.junit.ComparisonFailure: expected:<...:30 UTC 2018 (PT0.00[6]S ago), server=null> but was:<...:30 UTC 2018 (PT0.00[7]S ago), server=null> at org.apache.hadoop.hbase.client.TestRegionInfoDisplay.testRegionDetailsForDisplay(TestRegionInfoDisplay.java:78) {code} Here is how toDescriptiveString composes relTime: {code} long relTime = System.currentTimeMillis() - stamp; {code} In the test, state.toDescriptiveString() is called twice for the assertion where different return values from System.currentTimeMillis() caused the assertion to fail in the above occasion. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu reassigned HBASE-21381: -- Assignee: (was: Ted Yu) > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task > Reporter: Ted Yu >Priority: Major > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567) > {code} > Backup and Restore uses DistCp to transfer files between clusters. > Without the fix from HADOOP-15850, the transfer would fail. > This issue is to document the hadoop versions which contain HADOOP-15850 so > that user of Backup and Restore feature knows which hadoop versions they can > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670811#comment-16670811 ] Ted Yu commented on HBASE-21381: Putting the supported version under http://hbase.apache.org/book.html#backuprestore is fine. > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task > Reporter: Ted Yu >Priority: Major > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567) > {code} > Backup and Restore uses DistCp to transfer files between clusters. > Without the fix from HADOOP-15850, the transfer would fail. > This issue is to document the hadoop versions which contain HADOOP-15850 so > that user of Backup and Restore feature knows which hadoop versions they can > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (HBASE-21381) Document the hadoop versions using which backup and restore feature works
[ https://issues.apache.org/jira/browse/HBASE-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu reassigned HBASE-21381: -- Assignee: Ted Yu > Document the hadoop versions using which backup and restore feature works > - > > Key: HBASE-21381 > URL: https://issues.apache.org/jira/browse/HBASE-21381 > Project: HBase > Issue Type: Task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > > HADOOP-15850 fixes a bug where CopyCommitter#concatFileChunks unconditionally > tried to concatenate the files being DistCp'ed to target cluster (though the > files are independent). > Following is the log snippet of the failed concatenation attempt: > {code} > 2018-10-13 14:09:25,351 WARN [Thread-936] mapred.LocalJobRunner$Job(590): > job_local1795473782_0004 > java.io.IOException: Inconsistent sequence file: current chunk file > org.apache.hadoop.tools.CopyListingFileStatus@bb8826ee{hdfs://localhost:42796/user/hbase/test-data/ > > 160aeab5-6bca-9f87-465e-2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/a7599081e835440eb7bf0dd3ef4fd7a5_SeqId_205_ > length = 5100 aclEntries = null, xAttrs = null} doesnt match prior entry > org.apache.hadoop.tools.CopyListingFileStatus@243d544d{hdfs://localhost:42796/user/hbase/test-data/160aeab5-6bca-9f87-465e- > > 2517a0c43119/data/default/test-1539439707496/96b5a3613d52f4df1ba87a1cef20684c/f/394e6d39a9b94b148b9089c4fb967aad_SeqId_205_ > length = 5142 aclEntries = null, xAttrs = null} > at > org.apache.hadoop.tools.mapred.CopyCommitter.concatFileChunks(CopyCommitter.java:276) > at > org.apache.hadoop.tools.mapred.CopyCommitter.commitJob(CopyCommitter.java:100) > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:567) > {code} > Backup and Restore uses DistCp to transfer files between clusters. > Without the fix from HADOOP-15850, the transfer would fail. > This issue is to document the hadoop versions which contain HADOOP-15850 so > that user of Backup and Restore feature knows which hadoop versions they can > use. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670472#comment-16670472 ] Ted Yu commented on HBASE-21246: wal-providers.png is diagram for WALProvider related classes. wal-factory-providers.png is diagram involving WALProvider related classes and WALFactory class. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > wal-factory-providers.png, wal-providers.png > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21246: --- Attachment: wal-providers.png wal-factory-providers.png > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > wal-factory-providers.png, wal-providers.png > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition in snapshot cache refreshing leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Labels: snapshot (was: ) > Race condition in snapshot cache refreshing leads to loss of snapshot files > --- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.v1.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding SnapshotFileCache#refreshCache(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > // if the snapshot directory wasn't modified since we last check, we are > done > if (dirStatus.getModificationTime() <= this.lastModifiedTime) return; > // 1. update the modified time > this.lastModifiedTime = dirStatus.getModificationTime(); > // 2.clear the cache > this.cache.clear(); > {code} > Suppose the RefreshCacheTask runs past the if check and sets > this.lastModifiedTime > The cleaner executes refreshCache and returns immediately since > this.lastModifiedTime matches the modification time of the directory. > Now RefreshCacheTask clears the cache. By the time the cleaner performs cache > lookup, the cache is empty. > Therefore cleaner puts the file into unReferencedFiles - leading to data loss. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21407) Resolve NPE in backup Master UI
[ https://issues.apache.org/jira/browse/HBASE-21407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669515#comment-16669515 ] Ted Yu commented on HBASE-21407: [~openinx]: Can you commit this patch ? Thanks > Resolve NPE in backup Master UI > > > Key: HBASE-21407 > URL: https://issues.apache.org/jira/browse/HBASE-21407 > Project: HBase > Issue Type: Bug > Components: UI >Affects Versions: 3.0.0, 2.1.0, 2.2.0 >Reporter: Jingyun Tian >Assignee: Jingyun Tian >Priority: Minor > Fix For: 3.0.0, 2.1.0, 2.2.0 > > Attachments: hbase-21407.master.001.patch > > > Since some pages of our UI are using jsp instead of jamon, the fix of > HBASE-18263 is not enough. Added the fix of HBASE-18263 to the header.jsp. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21407) Resolve NPE in backup Master UI
[ https://issues.apache.org/jira/browse/HBASE-21407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669505#comment-16669505 ] Ted Yu commented on HBASE-21407: lgtm > Resolve NPE in backup Master UI > > > Key: HBASE-21407 > URL: https://issues.apache.org/jira/browse/HBASE-21407 > Project: HBase > Issue Type: Bug > Components: UI >Affects Versions: 3.0.0, 2.1.0, 2.2.0 >Reporter: Jingyun Tian >Assignee: Jingyun Tian >Priority: Minor > Fix For: 3.0.0, 2.1.0, 2.2.0 > > Attachments: hbase-21407.master.001.patch > > > Since some pages of our UI are using jsp instead of jamon, the fix of > HBASE-18263 is not enough. Added the fix of HBASE-18263 to the header.jsp. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (OMID-120) Utilize protobuf-maven-plugin for build
Ted Yu created OMID-120: --- Summary: Utilize protobuf-maven-plugin for build Key: OMID-120 URL: https://issues.apache.org/jira/browse/OMID-120 Project: Apache Omid Issue Type: Improvement Reporter: Ted Yu Currently protoc is required during build: {code} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default-cli) on project omid-common: An Ant BuildException has occured: Execute failed: java.io.IOException: Cannot run program "protoc" (in directory "/omid/common"): error=2, No such file or directory {code} We should utilize protobuf-maven-plugin so that developers don't have to install protoc on the build machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668955#comment-16668955 ] Ted Yu commented on HBASE-21246: bq. Missed push-down into WALProvider The push-down is implemented in patch v23. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21246: --- Attachment: 21246.23.txt > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task > Reporter: Ted Yu > Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (RATIS-377) Tolerate partially written log header
[ https://issues.apache.org/jira/browse/RATIS-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668849#comment-16668849 ] Ted Yu commented on RATIS-377: -- bq. How could this be possible? May I know how the contents of HEADER_BYTES would be modified ? I was thinking that if such modification is possible, we should be cautious with the clone as well. bq. does not make sense to add Flue as a dependency of Ratis just for it. Agreed. Assumption is that under the same license, we can pull in that class instead of adding dependency on the project. > Tolerate partially written log header > - > > Key: RATIS-377 > URL: https://issues.apache.org/jira/browse/RATIS-377 > Project: Ratis > Issue Type: Bug >Affects Versions: 0.3.0 >Reporter: Nilotpal Nandi >Assignee: Tsz Wo Nicholas Sze >Priority: Blocker > Fix For: 0.3.0 > > Attachments: r377_20181028c.patch > > > steps taken : > -- > # wrote 5GB files through ozonefs > # stopped datanodes, scm , om. > # started all services. > # Tried to read the file. > One of the datanodes failed to start. Throwing > "java.lang.IllegalStateException: Corrupted log header" > > {noformat} > 2018-10-26 10:26:01,317 ERROR org.apache.ratis.server.storage.LogInputStream: > caught exception initializing log_inprogress_293 > java.lang.IllegalStateException: Corrupted log header: ^@^@^@^@^@^@^@^@ > at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60) > at > org.apache.ratis.server.storage.LogInputStream.init(LogInputStream.java:93) > at > org.apache.ratis.server.storage.LogInputStream.nextEntry(LogInputStream.java:120) > at > org.apache.ratis.server.storage.LogSegment.readSegmentFile(LogSegment.java:111) > at > org.apache.ratis.server.storage.LogSegment.loadSegment(LogSegment.java:133) > at > org.apache.ratis.server.storage.RaftLogCache.loadSegment(RaftLogCache.java:110) > at > org.apache.ratis.server.storage.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:151) > at > org.apache.ratis.server.storage.SegmentedRaftLog.open(SegmentedRaftLog.java:120) > at org.apache.ratis.server.impl.ServerState.initLog(ServerState.java:191) > at org.apache.ratis.server.impl.ServerState.(ServerState.java:114) > at > org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:106) > at > org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:196) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) > at > java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1582) > at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) > at > java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) > at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) > at > java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) > 2018-10-26 10:26:03,671 INFO > org.apache.hadoop.ozone.web.netty.ObjectStoreRestHttpServer: Listening HDDS > REST traffic on /0.0.0.0:9880 > 2018-10-26 10:26:03,672 INFO org.apache.hadoop.ozone.HddsDatanodeService: > Started plug-in org.apache.hadoop.ozone.web.OzoneHddsDatanodeService@1e411d81 > 2018-10-26 10:26:03,676 INFO > org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer: Attempting to > start container services. > 2018-10-26 10:26:03,676 INFO > org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: > Starting XceiverServerRatis 0d7f5327-df16-40fe-ac88-7ed06e76a20f at port 9858 > 2018-10-26 10:26:03,702 ERROR > org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: > Unable to start the DatanodeState Machine > java.io.IOException: java.lang.IllegalStateException: Corrupted log header: > ^@^@^@^@^@^@^@^@ > at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:51) > at > org.apache.ratis.server.storage.LogInputStream.nextEntry(LogInputStream.java:123) > at > org.apache.ratis.server.storage.LogSegment.readSegmentFile(LogSegment.java:111) > at > org.apache.ratis.server.storage.LogSegment.loadSegment(LogSegment.java:133) > at > org.apache.ratis.server.storage.RaftLogCache.loadSegment(RaftLogCache.java:110) > at > org.apache.ratis.server.storage.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:151) > at > org.apache.ratis.server.storage.SegmentedRaftLog.open(SegmentedRaftLog.java:120) > at org.apache.ratis.server.impl.ServerState.init
[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16668840#comment-16668840 ] Ted Yu commented on HBASE-21246: bq. This is extracting the creation time, right? The start time is extracted. Good catch, rewritten in patch v21 with {{walProvider.getWALStartTime}}. bq. to make sure we aren't breaking what he and Zach are doing Agreed. >From my investigation so far, there is no conflict between WAL refactoring and >what's in place on the master branch. With HBASE-20734, the location for WAL and recovered edits is unified. This actually benefits WAL refactoring - we can get FileSystem from WAL Identity (Path). > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)