[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

Hudson (Jira) Sat, 10 Jul 2021 08:26:05 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378487#comment-17378487
 ]


Hudson commented on HBASE-26075:
--------------------------------

Results for branch branch-1
        [build #148 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/148/]:
 (x) *{color:red}-1 overall{color}*
----
details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/148//General_Nightly_Build_Report/]


(x) {color:red}-1 jdk7 checks{color}
-- For more information [see jdk7 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/148//JDK7_Nightly_Build_Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/148//JDK8_Nightly_Build_Report_(Hadoop2)/]




(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> Replication is stuck due to zero length wal file in oldWALs directory.
> ----------------------------------------------------------------------
>
>                 Key: HBASE-26075
>                 URL: https://issues.apache.org/jira/browse/HBASE-26075
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication, wal
>    Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>            Reporter: Rushabh Shah
>            Assignee: Rushabh Shah
>            Priority: Critical
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs://<cluster-name>/hbase/oldWALs/<walName> not a 
> SequenceFile
>         at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs://<cluster-name>/hbase/oldWALs/<walName> not a SequenceFile
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1842)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1856)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
>         at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
>         at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
>         at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
>         at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
>         ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs://<cluster-name>/hbase/WALs/<walName>
> 2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
> regionserver.WALEntryStream - Log hdfs://<cluster-name>/hbase/WALs/<walName> 
> was moved to hdfs://<cluster-name>/hbase/oldWALs/<walName>
> {noformat}
> There is a special logic in ReplicationSourceWALReader thread to handle 0 
> length files but we only look in WALs directory and not in oldWALs directory.
> {code}
>   private boolean handleEofException(Exception e, WALEntryBatch batch) {
>     PriorityBlockingQueue<Path> queue = logQueue.getQueue(walGroupId);
>     // Dump the log even if logQueue size is 1 if the source is from 
> recovered Source
>     // since we don't add current log to recovered source queue so it is safe 
> to remove.
>     if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
>       (source.isRecovered() || queue.size() > 1) && this.eofAutoRecovery) {
>       Path head = queue.peek();
>       try {
>         if (fs.getFileStatus(head).getLen() == 0) {
>           // head of the queue is an empty log file
>           LOG.warn("Forcing removal of 0 length log in queue: {}", head);
>           logQueue.remove(walGroupId);
>           currentPosition = 0;
>           if (batch != null) {
>             // After we removed the WAL from the queue, we should try 
> shipping the existing batch of
>             // entries
>             addBatchToShippingQueue(batch);
>           }
>           return true;
>         }
>       } catch (IOException ioe) {
>         LOG.warn("Couldn't get file length information about log " + 
> queue.peek(), ioe);
>       } catch (InterruptedException ie) {
>         LOG.trace("Interrupted while adding WAL batch to ship queue");
>         Thread.currentThread().interrupt();
>       }
>     }
>     return false;
>   }
> {code}
>  We need to add logic to look for such 0 length file in oldWALs directory 
> also.
> How this 0 length files are created ?
> When the 0 length wal was created, I see the following log messages.
> There were some datanode problems while creating the wal but assuming 
> optimization, we just chugged along and again rolled the wal once 
> FSHLog#append call failed.
> {noformat}
> 2021-07-02 00:18:35,069 WARN  [Thread-17] hdfs.DataStreamer - Could not get 
> block locations. Source file "/hbase/WALs/<walName>" - Aborting...
> 2021-07-02 00:18:35,072 WARN  [10.180.160.244:60020] hdfs.DFSClient - Error 
> while syncing
> java.io.IOException: Could not get block locations. Source file 
> "/hbase/WALs/<walName>" - Aborting...
>       at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251)
>       at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670)
> 2021-07-02 00:18:35,074 WARN  [10.180.160.244:60020] wal.FSHLog - pre-sync 
> failed but an optimization so keep going
> java.io.IOException: Could not get block locations. Source file 
> "/hbase/WALs/<walName>" - Aborting...
>       at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251)
>       at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670)
> ....
> ....
> 2021-07-02 00:18:57,303 WARN  [0020.append-pool8-t1] wal.FSHLog - Append 
> sequenceId=311389, requesting roll of WAL
> java.io.IOException: Could not get block locations. Source file 
> "/hbase/WALs/<walName>" - Aborting...
>       at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466)
>       at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251)
>       at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670)
>       
>       
> 2021-07-02 00:18:57,473 WARN  [.244:60020.logRoller] wal.ProtobufLogWriter - 
> Failed to write trailer, non-fatal, continuing...
>       java.io.IOException: Could not get block locations. Source file 
> "/hbase/WALs/<walName>" - Aborting...
>               at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1466)
>               at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1251)
>               at 
> org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:670)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

Reply via email to