[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles
[ https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064938#comment-17064938 ] Hudson commented on HBASE-23633: Results for branch branch-2.3 [build #5 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/5/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/5//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/5//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/5//JDK8_Nightly_Build_Report_(Hadoop3)/] (x) {color:red}-1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/5//JDK11_Nightly_Build_Report/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (x) {color:red}-1 client integration test{color} --Failed when running client tests on top of Hadoop 2. [see log for details|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/5//artifact/output-integration/hadoop-2.log]. (note that this means we didn't run on Hadoop 3) > Find a way to handle the corrupt recovered hfiles > - > > Key: HBASE-23633 > URL: https://issues.apache.org/jira/browse/HBASE-23633 > Project: HBase > Issue Type: Bug > Components: MTTR, wal >Affects Versions: 3.0.0, 2.3.0 >Reporter: Guanghao Zhang >Assignee: Pankaj Kumar >Priority: Critical > Fix For: 3.0.0, 2.3.0, 2.4.0 > > > Copy the comment from PR review. > > If the file is a corrupt HFile, an exception will be thrown here, which will > cause the region to fail to open. > Maybe we can add a new parameter to control whether to skip the exception, > similar to recover edits which has a parameter > "hbase.hregion.edits.replay.skip.errors"; > > Regions that can't be opened because of detached References or corrupt hfiles > are a fact-of-life. We need work on this issue. This will be a new variant on > the problem -- i.e. bad recovered hfiles. > On adding a config to ignore bad files and just open, thats a bit dangerous > as per @infraio as it could mean silent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles
[ https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064518#comment-17064518 ] Hudson commented on HBASE-23633: Results for branch branch-2 [build #2561 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2561/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2561//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2561//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2561//JDK8_Nightly_Build_Report_(Hadoop3)/] (x) {color:red}-1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2561//JDK11_Nightly_Build_Report/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (x) {color:red}-1 client integration test{color} --Failed when running client tests on top of Hadoop 2. [see log for details|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2561//artifact/output-integration/hadoop-2.log]. (note that this means we didn't run on Hadoop 3) > Find a way to handle the corrupt recovered hfiles > - > > Key: HBASE-23633 > URL: https://issues.apache.org/jira/browse/HBASE-23633 > Project: HBase > Issue Type: Bug > Components: MTTR, wal >Affects Versions: 3.0.0, 2.3.0 >Reporter: Guanghao Zhang >Assignee: Pankaj Kumar >Priority: Critical > Fix For: 3.0.0, 2.3.0, 2.4.0 > > > Copy the comment from PR review. > > If the file is a corrupt HFile, an exception will be thrown here, which will > cause the region to fail to open. > Maybe we can add a new parameter to control whether to skip the exception, > similar to recover edits which has a parameter > "hbase.hregion.edits.replay.skip.errors"; > > Regions that can't be opened because of detached References or corrupt hfiles > are a fact-of-life. We need work on this issue. This will be a new variant on > the problem -- i.e. bad recovered hfiles. > On adding a config to ignore bad files and just open, thats a bit dangerous > as per @infraio as it could mean silent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles
[ https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064379#comment-17064379 ] Hudson commented on HBASE-23633: Results for branch master [build #1675 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/1675/]: (x) *{color:red}-1 overall{color}* details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/master/1675//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/master/1675//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/master/1675//JDK8_Nightly_Build_Report_(Hadoop3)/] (x) {color:red}-1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://builds.apache.org/job/HBase%20Nightly/job/master/1675//JDK11_Nightly_Build_Report/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (x) {color:red}-1 client integration test{color} --Failed when running client tests on top of Hadoop 2. [see log for details|https://builds.apache.org/job/HBase%20Nightly/job/master/1675//artifact/output-integration/hadoop-2.log]. (note that this means we didn't run on Hadoop 3) > Find a way to handle the corrupt recovered hfiles > - > > Key: HBASE-23633 > URL: https://issues.apache.org/jira/browse/HBASE-23633 > Project: HBase > Issue Type: Bug > Components: MTTR, wal >Affects Versions: 3.0.0, 2.3.0 >Reporter: Guanghao Zhang >Assignee: Pankaj Kumar >Priority: Critical > Fix For: 3.0.0, 2.3.0, 2.4.0 > > > Copy the comment from PR review. > > If the file is a corrupt HFile, an exception will be thrown here, which will > cause the region to fail to open. > Maybe we can add a new parameter to control whether to skip the exception, > similar to recover edits which has a parameter > "hbase.hregion.edits.replay.skip.errors"; > > Regions that can't be opened because of detached References or corrupt hfiles > are a fact-of-life. We need work on this issue. This will be a new variant on > the problem -- i.e. bad recovered hfiles. > On adding a config to ignore bad files and just open, thats a bit dangerous > as per @infraio as it could mean silent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles
[ https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064206#comment-17064206 ] Guanghao Zhang commented on HBASE-23633: [~pankajkumar] As the PR not updated long time, I merged it and open a new issue to add ut. > Find a way to handle the corrupt recovered hfiles > - > > Key: HBASE-23633 > URL: https://issues.apache.org/jira/browse/HBASE-23633 > Project: HBase > Issue Type: Bug > Components: MTTR, wal >Affects Versions: 3.0.0, 2.3.0 >Reporter: Guanghao Zhang >Assignee: Pankaj Kumar >Priority: Critical > > Copy the comment from PR review. > > If the file is a corrupt HFile, an exception will be thrown here, which will > cause the region to fail to open. > Maybe we can add a new parameter to control whether to skip the exception, > similar to recover edits which has a parameter > "hbase.hregion.edits.replay.skip.errors"; > > Regions that can't be opened because of detached References or corrupt hfiles > are a fact-of-life. We need work on this issue. This will be a new variant on > the problem -- i.e. bad recovered hfiles. > On adding a config to ignore bad files and just open, thats a bit dangerous > as per @infraio as it could mean silent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles
[ https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029184#comment-17029184 ] Pankaj Kumar commented on HBASE-23633: -- Will raise PR with UT. > Find a way to handle the corrupt recovered hfiles > - > > Key: HBASE-23633 > URL: https://issues.apache.org/jira/browse/HBASE-23633 > Project: HBase > Issue Type: Sub-task >Reporter: Guanghao Zhang >Priority: Major > > Copy the comment from PR review. > > If the file is a corrupt HFile, an exception will be thrown here, which will > cause the region to fail to open. > Maybe we can add a new parameter to control whether to skip the exception, > similar to recover edits which has a parameter > "hbase.hregion.edits.replay.skip.errors"; > > Regions that can't be opened because of detached References or corrupt hfiles > are a fact-of-life. We need work on this issue. This will be a new variant on > the problem -- i.e. bad recovered hfiles. > On adding a config to ignore bad files and just open, thats a bit dangerous > as per @infraio as it could mean silent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles
[ https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028252#comment-17028252 ] Guanghao Zhang commented on HBASE-23633: [~pankajkumar] Are you working for this now? > Find a way to handle the corrupt recovered hfiles > - > > Key: HBASE-23633 > URL: https://issues.apache.org/jira/browse/HBASE-23633 > Project: HBase > Issue Type: Sub-task >Reporter: Guanghao Zhang >Priority: Major > > Copy the comment from PR review. > > If the file is a corrupt HFile, an exception will be thrown here, which will > cause the region to fail to open. > Maybe we can add a new parameter to control whether to skip the exception, > similar to recover edits which has a parameter > "hbase.hregion.edits.replay.skip.errors"; > > Regions that can't be opened because of detached References or corrupt hfiles > are a fact-of-life. We need work on this issue. This will be a new variant on > the problem -- i.e. bad recovered hfiles. > On adding a config to ignore bad files and just open, thats a bit dangerous > as per @infraio as it could mean silent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles
[ https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026669#comment-17026669 ] Pankaj Kumar commented on HBASE-23633: -- In my test scenario all corrupted hfiles were of zero length. We should check & delete zero length file during recovered hfile bulkload, the same way how it handled while replaying edits, {code} private long loadRecoveredHFilesIfAny(Collection stores) throws IOException { Path regionDir = getWALRegionDir(); long maxSeqId = -1; for (HStore store : stores) { String familyName = store.getColumnFamilyName(); FileStatus[] files = WALSplitUtil.getRecoveredHFiles(fs.getFileSystem(), regionDir, familyName); if (files != null && files.length != 0) { for (FileStatus file : files) { // Check and delete the zero length file if (isZeroLengthThenDelete(fs.getFileSystem(), file.getPath())) { continue; } store.assertBulkLoadHFileOk(file.getPath()); {code} > Find a way to handle the corrupt recovered hfiles > - > > Key: HBASE-23633 > URL: https://issues.apache.org/jira/browse/HBASE-23633 > Project: HBase > Issue Type: Sub-task >Reporter: Guanghao Zhang >Priority: Major > > Copy the comment from PR review. > > If the file is a corrupt HFile, an exception will be thrown here, which will > cause the region to fail to open. > Maybe we can add a new parameter to control whether to skip the exception, > similar to recover edits which has a parameter > "hbase.hregion.edits.replay.skip.errors"; > > Regions that can't be opened because of detached References or corrupt hfiles > are a fact-of-life. We need work on this issue. This will be a new variant on > the problem -- i.e. bad recovered hfiles. > On adding a config to ignore bad files and just open, thats a bit dangerous > as per @infraio as it could mean silent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles
[ https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025829#comment-17025829 ] Pankaj Kumar commented on HBASE-23633: -- I also observed this problem during test, many regions *FAILED* to open due to CorruptHFileException. {noformat} 2020-01-29 07:07:13,911 | INFO | RS_OPEN_REGION-RS-IP:RS-PORT-2 | Validating hfile at hdfs://cluster/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793 for inclusion in store family region usertable01,user35466,1580220595485.a2f0e8b46399ce55e864d4ee7311c845. | org.apache.hadoop.hbase.regionserver.HStore.assertBulkLoadHFileOk(HStore.java:730) 2020-01-29 07:07:13,930 | ERROR | RS_OPEN_REGION-RS-IP:RS-PORT-2 | Failed open of region=usertable01,user35466,1580220595485.a2f0e8b46399ce55e864d4ee7311c845., starting to roll back the global memstore size. | org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:386) org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from file hdfs://cluster/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793 at org.apache.hadoop.hbase.io.hfile.HFile.openReader(HFile.java:503) at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:562) at org.apache.hadoop.hbase.regionserver.HStore.assertBulkLoadHFileOk(HStore.java:732) at org.apache.hadoop.hbase.regionserver.HRegion.loadRecoveredHFilesIfAny(HRegion.java:4905) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:863) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:824) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7023) {noformat} After digging more into the log, observed this problem occured when "split-log-closeStream" thread was splitting WAL into hfile and Region Server abort due to some region. So the "split-log-closeStream" thread was interrupted and left the recovered hfile in an intermediate state. {noformat} 2020-01-28 23:01:04,962 | WARN | RS_LOG_REPLAY_OPS-8-5-179-5:RS-PORT-0 | log splitting of WALs/RS-IP,RS-PORT,1580220469213-splitting/RS-IP%2CRS-PORT%2C1580220469213.1580222580793 interrupted, resigning | org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111) java.io.InterruptedIOException at org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.writeRemainingEntryBuffers(BoundedRecoveredHFilesOutputSink.java:186) at org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.close(BoundedRecoveredHFilesOutputSink.java:155) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:404) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:225) at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:105) at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193) at org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.writeRemainingEntryBuffers(BoundedRecoveredHFilesOutputSink.java:179) ... 9 more {noformat} Further I checked and confirmed from the NN audit log that file was not written completelty and RS went down, {noformat} 2020-01-28 23:01:04,946 | INFO | IPC Server handler 125 on 25000 | BLOCK* allocate blk_1092127264_18392260, replicas=DN-IP1:DN-PORT, DN-IP2:DN-PORT, DN-IP3:DN-PORT for /hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793 | FSDirWriteFileOp.java:856 2020-01-29 00:01:04,956 | INFO | org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@862fb5 | Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_-1098699935_1, pending creates: 21], src=/hbase/dat
[jira] [Commented] (HBASE-23633) Find a way to handle the corrupt recovered hfiles
[ https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007690#comment-17007690 ] Michael Stack commented on HBASE-23633: --- What I want is an indication in Master log that a Region is not opening because hfiles are corrupt or we have dangling references. Currently it just says fail but not why (smile). > Find a way to handle the corrupt recovered hfiles > - > > Key: HBASE-23633 > URL: https://issues.apache.org/jira/browse/HBASE-23633 > Project: HBase > Issue Type: Improvement >Reporter: Guanghao Zhang >Priority: Major > > Copy the comment from PR review. > > If the file is a corrupt HFile, an exception will be thrown here, which will > cause the region to fail to open. > Maybe we can add a new parameter to control whether to skip the exception, > similar to recover edits which has a parameter > "hbase.hregion.edits.replay.skip.errors"; > > Regions that can't be opened because of detached References or corrupt hfiles > are a fact-of-life. We need work on this issue. This will be a new variant on > the problem -- i.e. bad recovered hfiles. > On adding a config to ignore bad files and just open, thats a bit dangerous > as per @infraio as it could mean silent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)