[
https://issues.apache.org/jira/browse/HBASE-25887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Reid Chan resolved HBASE-25887.
-------------------------------
Hadoop Flags: Reviewed
Resolution: Fixed
> Corrupt wal while region server is aborting.
> --------------------------------------------
>
> Key: HBASE-25887
> URL: https://issues.apache.org/jira/browse/HBASE-25887
> Project: HBase
> Issue Type: Improvement
> Components: regionserver, wal
> Affects Versions: 1.6.0
> Reporter: Rushabh Shah
> Assignee: Rushabh Shah
> Priority: Major
> Fix For: 1.7.0
>
>
> We have seen a case in our production cluster where we ended up in corrupt
> wal. WALSplitter logged the below error
> {noformat}
> 2021-05-12 00:42:46,786 FATAL [XXXX:60020-1] regionserver.HRegionServer -
> ABORTING region server HOST-B,60020,16207794418
> 88: Caught throwable while processing event RS_LOG_REPLAY
> java.lang.NullPointerException
> at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:411)
> at
> org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEditFamily(WALEdit.java:145)
> at
> org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEdit(WALEdit.java:150)
> at
> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:408)
> at
> org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:261)
> at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:105)
> at
> org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Looking at the raw wal file, we could see that the last WALEdit contains the
> region id, tablename and sequence number but cells were not persisted.
> Looking at the logs of the RS that generated that corrupt wal file,
> {noformat}
> 2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL writer
> in /hbase/WALs/HOST-A,60020,1620774393046
> 2021-05-11 23:29:22,196 DEBUG [/HOST-A:60020] ipc.AbstractRpcClient -
> Stopping rpc client
> 2021-05-11 23:29:22,198 INFO [/HOST-A:60020] regionserver.Leases -
> regionserver/HOST-A/:60020 closing leases
> 2021-05-11 23:29:22,198 INFO [/HOST-A:60020] regionserver.Leases -
> regionserver/HOST-A:/HOST-A:60020 closed leases
> 2021-05-11 23:29:22,198 WARN [0020.append-pool8-t1] wal.FSHLog - Append
> sequenceId=7147823, requesting roll of WAL
> java.nio.channels.ClosedChannelException
> at
> org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:331)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:151)
> at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105)
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> at org.apache.hadoop.hbase.KeyValue.write(KeyValue.java:2543)
> at
> org.apache.phoenix.hbase.index.wal.KeyValueCodec.write(KeyValueCodec.java:104)
> at
> org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueEncoder.write(IndexedWALEditCodec.java:218)
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:128)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:2083)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1941)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> at
> com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> These 2 lines are interesting.
> {quote}2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL
> writer in /hbase/WALs/HOST-A,60020,1620774393046
> ....
> ....
> 2021-05-11 23:29:22,198 WARN [0020.append-pool8-t1] wal.FSHLog - Append
> sequenceId=7147823, requesting roll of WAL
> java.nio.channels.ClosedChannelException
> {quote}
> The append thread encountered java.nio.channels.ClosedChannelException while
> writing to wal file because the wal file was already closed.
> This is the sequence of shutting down of threads when RS aborts.
> {noformat}
> // With disruptor down, this is safe to let go.
> if (this.appendExecutor != null) this.appendExecutor.shutdown();
> // Tell our listeners that the log is closing
> ...
> if (LOG.isDebugEnabled()) {
> LOG.debug("Closing WAL writer in " + FSUtils.getPath(fullPathLogDir));
> }
> if (this.writer != null) {
> this.writer.close();
> this.writer = null;
> }
> }
> {noformat}
> 1. Shutdown the appendExecutor (but we *don't* wait for the currently running
> tasks to complete)
> 2. Close the writer. This will in turn close the wal log file.
> We need to wait for appendExecutor to terminate properly and then only close
> the writer.
> Code has been changed a lot in master branch. Need to study whether this bug
> exists in master/branch-2
--
This message was sent by Atlassian Jira
(v8.3.4#803005)