Rushabh Shah created HBASE-25536:
------------------------------------

             Summary: Remove 0 length wal file from queue if it belongs to old 
sources.
                 Key: HBASE-25536
                 URL: https://issues.apache.org/jira/browse/HBASE-25536
             Project: HBase
          Issue Type: Improvement
          Components: Replication
    Affects Versions: 1.6.0
            Reporter: Rushabh Shah
            Assignee: Rushabh Shah
             Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.3.5, 2.4.2


In our production clusters, we found one case where RS is not removing 0 length 
file from replication queue (in memory one not the zk replication queue) if the 
logQueue size is 1.
 Stack trace below:
{noformat}
2021-01-28 14:44:18,434 ERROR [,60020,1609950703085] 
regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
replication entries
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
 java.io.EOFException: 
hdfs://hbase/oldWALs/<rs-name>%2C60020%2C1606126266791.1606852981112 not a 
SequenceFile
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:147)
Caused by: java.io.EOFException: 
hdfs://hbase/oldWALs/<rs-name>%2C60020%2C1606126266791.1606852981112 not a 
SequenceFile
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
        at 
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
        at 
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1842)
        at 
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1856)
        at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
        at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
        at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
        at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
        at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
        at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
        at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
        at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:338)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:304)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:295)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:198)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:108)
        ... 1 more
{noformat}
The wal in question is of length 0 (verified via hadoop ls command) and is from 
recovered sources. There is just 1 log file in the queue (verified via heap 
dump).

 We have logic to remove 0 length log file from queue when we encounter 
EOFException and logQueue#size is greater than 1. Code snippet below.
{code:java|title=ReplicationSourceWALReader.java|borderStyle=solid}
  // if we get an EOF due to a zero-length log, and there are other logs in 
queue
  // (highly likely we've closed the current log), we've hit the max retries, 
and autorecovery is
  // enabled, then dump the log
  private void handleEofException(IOException e) {
    if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
       logQueue.size() > 1 && this.eofAutoRecovery) {
      try {
        if (fs.getFileStatus(logQueue.peek()).getLen() == 0) {
          LOG.warn("Forcing removal of 0 length log in queue: " + 
logQueue.peek());
          logQueue.remove();
          currentPosition = 0;
        }
      } catch (IOException ioe) {
        LOG.warn("Couldn't get file length information about log " + 
logQueue.peek());
      }
    }
  }
{code}
This size check is valid for active sources where we need to have atleast one 
wal file which is the current wal file but for recovered sources where we don't 
add current wal file to queue, we can skip the logQueue#size check.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to