[ https://issues.apache.org/jira/browse/HBASE-25887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reid Chan resolved HBASE-25887. ------------------------------- Hadoop Flags: Reviewed Resolution: Fixed > Corrupt wal while region server is aborting. > -------------------------------------------- > > Key: HBASE-25887 > URL: https://issues.apache.org/jira/browse/HBASE-25887 > Project: HBase > Issue Type: Improvement > Components: regionserver, wal > Affects Versions: 1.6.0 > Reporter: Rushabh Shah > Assignee: Rushabh Shah > Priority: Major > Fix For: 1.7.0 > > > We have seen a case in our production cluster where we ended up in corrupt > wal. WALSplitter logged the below error > {noformat} > 2021-05-12 00:42:46,786 FATAL [XXXX:60020-1] regionserver.HRegionServer - > ABORTING region server HOST-B,60020,16207794418 > 88: Caught throwable while processing event RS_LOG_REPLAY > java.lang.NullPointerException > at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:411) > at > org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEditFamily(WALEdit.java:145) > at > org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEdit(WALEdit.java:150) > at > org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:408) > at > org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:261) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:105) > at > org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} > Looking at the raw wal file, we could see that the last WALEdit contains the > region id, tablename and sequence number but cells were not persisted. > Looking at the logs of the RS that generated that corrupt wal file, > {noformat} > 2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL writer > in /hbase/WALs/HOST-A,60020,1620774393046 > 2021-05-11 23:29:22,196 DEBUG [/HOST-A:60020] ipc.AbstractRpcClient - > Stopping rpc client > 2021-05-11 23:29:22,198 INFO [/HOST-A:60020] regionserver.Leases - > regionserver/HOST-A/:60020 closing leases > 2021-05-11 23:29:22,198 INFO [/HOST-A:60020] regionserver.Leases - > regionserver/HOST-A:/HOST-A:60020 closed leases > 2021-05-11 23:29:22,198 WARN [0020.append-pool8-t1] wal.FSHLog - Append > sequenceId=7147823, requesting roll of WAL > java.nio.channels.ClosedChannelException > at > org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:331) > at > org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:151) > at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at org.apache.hadoop.hbase.KeyValue.write(KeyValue.java:2543) > at > org.apache.phoenix.hbase.index.wal.KeyValueCodec.write(KeyValueCodec.java:104) > at > org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueEncoder.write(IndexedWALEditCodec.java:218) > at > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:128) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:2083) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1941) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857) > at > com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} > These 2 lines are interesting. > {quote}2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL > writer in /hbase/WALs/HOST-A,60020,1620774393046 > .... > .... > 2021-05-11 23:29:22,198 WARN [0020.append-pool8-t1] wal.FSHLog - Append > sequenceId=7147823, requesting roll of WAL > java.nio.channels.ClosedChannelException > {quote} > The append thread encountered java.nio.channels.ClosedChannelException while > writing to wal file because the wal file was already closed. > This is the sequence of shutting down of threads when RS aborts. > {noformat} > // With disruptor down, this is safe to let go. > if (this.appendExecutor != null) this.appendExecutor.shutdown(); > // Tell our listeners that the log is closing > ... > if (LOG.isDebugEnabled()) { > LOG.debug("Closing WAL writer in " + FSUtils.getPath(fullPathLogDir)); > } > if (this.writer != null) { > this.writer.close(); > this.writer = null; > } > } > {noformat} > 1. Shutdown the appendExecutor (but we *don't* wait for the currently running > tasks to complete) > 2. Close the writer. This will in turn close the wal log file. > We need to wait for appendExecutor to terminate properly and then only close > the writer. > Code has been changed a lot in master branch. Need to study whether this bug > exists in master/branch-2 -- This message was sent by Atlassian Jira (v8.3.4#803005)