Rushabh Shah created HBASE-25887:
------------------------------------

             Summary: Corrupt wal while region server is aborting.
                 Key: HBASE-25887
                 URL: https://issues.apache.org/jira/browse/HBASE-25887
             Project: HBase
          Issue Type: Improvement
          Components: regionserver, wal
    Affects Versions: 1.6.0
            Reporter: Rushabh Shah
            Assignee: Rushabh Shah


We have seen a case in our production cluster where we ended up in corrupt wal. 
WALSplitter logged the below error
{noformat}
2021-05-12 00:42:46,786 FATAL [XXXX:60020-1] regionserver.HRegionServer - 
ABORTING region server HOST-B,60020,16207794418
88: Caught throwable while processing event RS_LOG_REPLAY
java.lang.NullPointerException
        at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:411)
        at 
org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEditFamily(WALEdit.java:145)
        at 
org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEdit(WALEdit.java:150)
        at 
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:408)
        at 
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:261)
        at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:105)
        at 
org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{noformat}
Looking at the raw wal file, we could see that the last WALEdit contains the 
region id, tablename and sequence number but cells were not persisted.
 Looking at the logs of the RS that generated that corrupt wal file,
{noformat}
2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL writer 
in /hbase/WALs/HOST-A,60020,1620774393046
2021-05-11 23:29:22,196 DEBUG [/HOST-A:60020] ipc.AbstractRpcClient - Stopping 
rpc client
2021-05-11 23:29:22,198 INFO  [/HOST-A:60020] regionserver.Leases - 
regionserver/HOST-A/:60020 closing leases
2021-05-11 23:29:22,198 INFO  [/HOST-A:60020] regionserver.Leases - 
regionserver/HOST-A:/HOST-A:60020 closed leases
2021-05-11 23:29:22,198 WARN  [0020.append-pool8-t1] wal.FSHLog - Append 
sequenceId=7147823, requesting roll of WAL
java.nio.channels.ClosedChannelException
        at 
org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:331)
        at 
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:151)
        at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at org.apache.hadoop.hbase.KeyValue.write(KeyValue.java:2543)
        at 
org.apache.phoenix.hbase.index.wal.KeyValueCodec.write(KeyValueCodec.java:104)
        at 
org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueEncoder.write(IndexedWALEditCodec.java:218)
        at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:128)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:2083)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1941)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
        at 
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{noformat}
These 2 lines are interesting.
{quote}2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL 
writer in /hbase/WALs/HOST-A,60020,1620774393046
 ....
 ....
 2021-05-11 23:29:22,198 WARN [0020.append-pool8-t1] wal.FSHLog - Append 
sequenceId=7147823, requesting roll of WAL
 java.nio.channels.ClosedChannelException
{quote}
The append thread encountered java.nio.channels.ClosedChannelException while 
writing to wal file because the wal file was already closed.

This is the sequence of shutting down of threads when RS aborts.
{noformat}
      // With disruptor down, this is safe to let go.
      if (this.appendExecutor !=  null) this.appendExecutor.shutdown();

      // Tell our listeners that the log is closing
       ...
      if (LOG.isDebugEnabled()) {
        LOG.debug("Closing WAL writer in " + FSUtils.getPath(fullPathLogDir));
      }
      if (this.writer != null) {
        this.writer.close();
        this.writer = null;
      }
    }
{noformat}
1. Shutdown the appendExecutor (but we *don't* wait for the currently running 
tasks to complete)
 2. Close the writer. This will in turn close the wal log file.

We need to wait for appendExecutor to terminate properly and then only close 
the writer.

Code has been changed a lot in master branch. Need to study whether this bug 
exists in master/branch-2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to