Rushabh Shah created HBASE-25536:
------------------------------------
Summary: Remove 0 length wal file from queue if it belongs to old
sources.
Key: HBASE-25536
URL: https://issues.apache.org/jira/browse/HBASE-25536
Project: HBase
Issue Type: Improvement
Components: Replication
Affects Versions: 1.6.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah
Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.3.5, 2.4.2
In our production clusters, we found one case where RS is not removing 0 length
file from replication queue (in memory one not the zk replication queue) if the
logQueue size is 1.
Stack trace below:
{noformat}
2021-01-28 14:44:18,434 ERROR [,60020,1609950703085]
regionserver.ReplicationSourceWALReaderThread - Failed to read stream of
replication entries
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
java.io.EOFException:
hdfs://hbase/oldWALs/<rs-name>%2C60020%2C1606126266791.1606852981112 not a
SequenceFile
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:147)
Caused by: java.io.EOFException:
hdfs://hbase/oldWALs/<rs-name>%2C60020%2C1606126266791.1606852981112 not a
SequenceFile
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
at
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1842)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1856)
at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
at
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
at
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
at
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
at
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
at
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:338)
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:304)
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:295)
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:198)
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:108)
... 1 more
{noformat}
The wal in question is of length 0 (verified via hadoop ls command) and is from
recovered sources. There is just 1 log file in the queue (verified via heap
dump).
We have logic to remove 0 length log file from queue when we encounter
EOFException and logQueue#size is greater than 1. Code snippet below.
{code:java|title=ReplicationSourceWALReader.java|borderStyle=solid}
// if we get an EOF due to a zero-length log, and there are other logs in
queue
// (highly likely we've closed the current log), we've hit the max retries,
and autorecovery is
// enabled, then dump the log
private void handleEofException(IOException e) {
if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
logQueue.size() > 1 && this.eofAutoRecovery) {
try {
if (fs.getFileStatus(logQueue.peek()).getLen() == 0) {
LOG.warn("Forcing removal of 0 length log in queue: " +
logQueue.peek());
logQueue.remove();
currentPosition = 0;
}
} catch (IOException ioe) {
LOG.warn("Couldn't get file length information about log " +
logQueue.peek());
}
}
}
{code}
This size check is valid for active sources where we need to have atleast one
wal file which is the current wal file but for recovered sources where we don't
add current wal file to queue, we can skip the logQueue#size check.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)