[ https://issues.apache.org/jira/browse/HBASE-23633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025829#comment-17025829 ]
Pankaj Kumar commented on HBASE-23633: -------------------------------------- I also observed this problem during test, many regions *FAILED* to open due to CorruptHFileException. {noformat} 2020-01-29 07:07:13,911 | INFO | RS_OPEN_REGION-RS-IP:RS-PORT-2 | Validating hfile at hdfs://cluster/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/0000000000000000290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793 for inclusion in store family region usertable01,user35466,1580220595485.a2f0e8b46399ce55e864d4ee7311c845. | org.apache.hadoop.hbase.regionserver.HStore.assertBulkLoadHFileOk(HStore.java:730) 2020-01-29 07:07:13,930 | ERROR | RS_OPEN_REGION-RS-IP:RS-PORT-2 | Failed open of region=usertable01,user35466,1580220595485.a2f0e8b46399ce55e864d4ee7311c845., starting to roll back the global memstore size. | org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:386) org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from file hdfs://cluster/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/0000000000000000290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793 at org.apache.hadoop.hbase.io.hfile.HFile.openReader(HFile.java:503) at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:562) at org.apache.hadoop.hbase.regionserver.HStore.assertBulkLoadHFileOk(HStore.java:732) at org.apache.hadoop.hbase.regionserver.HRegion.loadRecoveredHFilesIfAny(HRegion.java:4905) at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:863) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:824) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:7023) {noformat} After digging more into the log, observed this problem occured when "split-log-closeStream" thread was splitting WAL into hfile and Region Server abort due to some region. So the "split-log-closeStream" thread was interrupted and left the recovered hfile in an intermediate state. {noformat} 2020-01-28 23:01:04,962 | WARN | RS_LOG_REPLAY_OPS-8-5-179-5:RS-PORT-0 | log splitting of WALs/RS-IP,RS-PORT,1580220469213-splitting/RS-IP%2CRS-PORT%2C1580220469213.1580222580793 interrupted, resigning | org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111) java.io.InterruptedIOException at org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.writeRemainingEntryBuffers(BoundedRecoveredHFilesOutputSink.java:186) at org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.close(BoundedRecoveredHFilesOutputSink.java:155) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:404) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:225) at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:105) at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193) at org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.writeRemainingEntryBuffers(BoundedRecoveredHFilesOutputSink.java:179) ... 9 more {noformat} Further I checked and confirmed from the NN audit log that file was not written completelty and RS went down, {noformat} 2020-01-28 23:01:04,946 | INFO | IPC Server handler 125 on 25000 | BLOCK* allocate blk_1092127264_18392260, replicas=DN-IP1:DN-PORT, DN-IP2:DN-PORT, DN-IP3:DN-PORT for /hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/0000000000000000290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793 | FSDirWriteFileOp.java:856 ---- 2020-01-29 00:01:04,956 | INFO | org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@862fb5 | Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_-1098699935_1, pending creates: 21], src=/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/0000000000000000290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793 | FSNamesystem.java:3344 2020-01-29 00:01:04,957 | WARN | org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@862fb5 | DIR* NameSystem.internalReleaseLease: File /hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/0000000000000000290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793 has not been closed. Lease recovery is in progress. RecoveryId = 18395023 for block blk_1092127264_18392260 | FSNamesystem.java:3470 2020-01-29 00:01:14,504 | INFO | IPC Server handler 0 on 25006 | commitBlockSynchronization(oldBlock=BP-2062589142-192.168.250.11-1574429102552:blk_1092127264_18392260, file=/hbase/data/default/usertable01/a2f0e8b46399ce55e864d4ee7311c845/family/recovered.hfiles/0000000000000000290-RS-IP%2CRS-PORT%2C1580220469213.1580222580793, newgenerationstamp=18395023, newlength=0, newtargets=[]) successful | FSNamesystem.java:3748 {noformat} Since WAL split was interrupted, so HMaster will recover it by resubmitting the WAL split task. So there may not be data loss IMO. We should cleanup such corrupted hfile. What do you think [~zghao] [~stack] sir ? > Find a way to handle the corrupt recovered hfiles > ------------------------------------------------- > > Key: HBASE-23633 > URL: https://issues.apache.org/jira/browse/HBASE-23633 > Project: HBase > Issue Type: Sub-task > Reporter: Guanghao Zhang > Priority: Major > > Copy the comment from PR review. > > If the file is a corrupt HFile, an exception will be thrown here, which will > cause the region to fail to open. > Maybe we can add a new parameter to control whether to skip the exception, > similar to recover edits which has a parameter > "hbase.hregion.edits.replay.skip.errors"; > > Regions that can't be opened because of detached References or corrupt hfiles > are a fact-of-life. We need work on this issue. This will be a new variant on > the problem -- i.e. bad recovered hfiles. > On adding a config to ignore bad files and just open, thats a bit dangerous > as per @infraio .... as it could mean silent data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)