Rushabh Shah created HBASE-25887: ------------------------------------ Summary: Corrupt wal while region server is aborting. Key: HBASE-25887 URL: https://issues.apache.org/jira/browse/HBASE-25887 Project: HBase Issue Type: Improvement Components: regionserver, wal Affects Versions: 1.6.0 Reporter: Rushabh Shah Assignee: Rushabh Shah
We have seen a case in our production cluster where we ended up in corrupt wal. WALSplitter logged the below error {noformat} 2021-05-12 00:42:46,786 FATAL [XXXX:60020-1] regionserver.HRegionServer - ABORTING region server HOST-B,60020,16207794418 88: Caught throwable while processing event RS_LOG_REPLAY java.lang.NullPointerException at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:411) at org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEditFamily(WALEdit.java:145) at org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEdit(WALEdit.java:150) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:408) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:261) at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:105) at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} Looking at the raw wal file, we could see that the last WALEdit contains the region id, tablename and sequence number but cells were not persisted. Looking at the logs of the RS that generated that corrupt wal file, {noformat} 2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL writer in /hbase/WALs/HOST-A,60020,1620774393046 2021-05-11 23:29:22,196 DEBUG [/HOST-A:60020] ipc.AbstractRpcClient - Stopping rpc client 2021-05-11 23:29:22,198 INFO [/HOST-A:60020] regionserver.Leases - regionserver/HOST-A/:60020 closing leases 2021-05-11 23:29:22,198 INFO [/HOST-A:60020] regionserver.Leases - regionserver/HOST-A:/HOST-A:60020 closed leases 2021-05-11 23:29:22,198 WARN [0020.append-pool8-t1] wal.FSHLog - Append sequenceId=7147823, requesting roll of WAL java.nio.channels.ClosedChannelException at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:331) at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:151) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.hbase.KeyValue.write(KeyValue.java:2543) at org.apache.phoenix.hbase.index.wal.KeyValueCodec.write(KeyValueCodec.java:104) at org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueEncoder.write(IndexedWALEditCodec.java:218) at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:128) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:2083) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1941) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} These 2 lines are interesting. {quote}2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL writer in /hbase/WALs/HOST-A,60020,1620774393046 .... .... 2021-05-11 23:29:22,198 WARN [0020.append-pool8-t1] wal.FSHLog - Append sequenceId=7147823, requesting roll of WAL java.nio.channels.ClosedChannelException {quote} The append thread encountered java.nio.channels.ClosedChannelException while writing to wal file because the wal file was already closed. This is the sequence of shutting down of threads when RS aborts. {noformat} // With disruptor down, this is safe to let go. if (this.appendExecutor != null) this.appendExecutor.shutdown(); // Tell our listeners that the log is closing ... if (LOG.isDebugEnabled()) { LOG.debug("Closing WAL writer in " + FSUtils.getPath(fullPathLogDir)); } if (this.writer != null) { this.writer.close(); this.writer = null; } } {noformat} 1. Shutdown the appendExecutor (but we *don't* wait for the currently running tasks to complete) 2. Close the writer. This will in turn close the wal log file. We need to wait for appendExecutor to terminate properly and then only close the writer. Code has been changed a lot in master branch. Need to study whether this bug exists in master/branch-2 -- This message was sent by Atlassian Jira (v8.3.4#803005)