[jira] [Updated] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-09 Thread Bharath Vissapragada (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-26075:
-
Release Note: Handles 0 length WAL files moved to oldWALs directory so that 
they do not block the replication queue.

> Replication is stuck due to zero length wal file in oldWALs directory.
> --
>
> Key: HBASE-26075
> URL: https://issues.apache.org/jira/browse/HBASE-26075
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Critical
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
> SequenceFile
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs:///hbase/oldWALs/ not a SequenceFile
> at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs:///hbase/WALs/
> 2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
> regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
> was moved to hdfs:///hbase/oldWALs/
> {noformat}
> There is a special logic in ReplicationSourceWALReader thread to handle 0 
> length files but we only look in WALs directory and not in oldWALs directory.
> {code}
>   private boolean handleEofException(Exception e, WALEntryBatch batch) {
> PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
> // Dump the log even if logQueue size is 1 if the source is from 
> recovered Source
> // since we don't add current log to recovered source queue so it is safe 
> to remove.
> if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
>   (source.isRecovered() || queue.size() > 1) && this.eofAutoRecovery) {
>  

[jira] [Updated] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-08 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HBASE-26075:
-
Description: 
Recently we encountered a case where size of log queue was increasing to around 
300 in few region servers in our production environment.

There were 295 wals in the oldWALs directory for that region server and the 
*first file* was a 0 length file.

Replication was throwing the following error.

{noformat}
2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
replication entries
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
 java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
SequenceFile
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
Caused by: java.io.EOFException: hdfs:///hbase/oldWALs/ 
not a SequenceFile
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
at 
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
at 
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
at 
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
at 
org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
... 1 more
{noformat}

We fixed similar error  via HBASE-25536 but the zero length file was in 
recovered sources.

There were more logs after the above stack trace.

{noformat}
2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
information about log 
hdfs:///hbase/WALs/
2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
was moved to hdfs:///hbase/oldWALs/
{noformat}


There is a special logic in ReplicationSourceWALReader thread to handle 0 
length files but we only look in WALs directory and not in oldWALs directory.

{code}
  private boolean handleEofException(Exception e, WALEntryBatch batch) {
PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
// Dump the log even if logQueue size is 1 if the source is from recovered 
Source
// since we don't add current log to recovered source queue so it is safe 
to remove.
if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
  (source.isRecovered() || queue.size() > 1) && this.eofAutoRecovery) {
  Path head = queue.peek();
  try {
if (fs.getFileStatus(head).getLen() == 0) {
  // head of the queue is an empty log file
  LOG.warn("Forcing removal of 0 length log in queue: {}", head);
  logQueue.remove(walGroupId);
  currentPosition = 0;
  if (batch != null) {
// After we removed the WAL from the queue, we should try shipping 
the existing batch of
// entries
addBatchToShippingQueue(batch);
  }
  return true;
}
  } catch (IOException ioe) {
LOG.warn("Couldn't get file length information about log " + 
queue.peek(), ioe);
  } catch (InterruptedException ie) {
LOG.trace("Interrupted while adding WAL batch to ship queue");

[jira] [Updated] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-08 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HBASE-26075:
-
Affects Version/s: 3.0.0-alpha-1
   2.3.5
   2.4.4

> Replication is stuck due to zero length wal file in oldWALs directory.
> --
>
> Key: HBASE-26075
> URL: https://issues.apache.org/jira/browse/HBASE-26075
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Critical
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
> SequenceFile
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs:///hbase/oldWALs/ not a SequenceFile
> at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs:///hbase/WALs/
> 2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
> regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
> was moved to hdfs:///hbase/oldWALs/
> {noformat}
> There is a special logic in ReplicationSourceWALReader thread to handle 0 
> length files but we only look in WALs directory and not in oldWALs directory.
> {code}
>   private boolean handleEofException(Exception e, WALEntryBatch batch) {
> PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
> // Dump the log even if logQueue size is 1 if the source is from 
> recovered Source
> // since we don't add current log to recovered source queue so it is safe 
> to remove.
> if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
>   (source.isRecovered() || queue.size() > 1) && this.eofAutoRecovery) {
>   Path head = queue.peek();
>   try