[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.
[ https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381772#comment-17381772 ] Bharath Vissapragada commented on HBASE-26075: -- [~shahrs87] Can you please submit the PRs for master/branch-2 as a part of the child task? marking this resolved for 1.7.1 and the release work is in progress. > Replication is stuck due to zero length wal file in oldWALs directory. > -- > > Key: HBASE-26075 > URL: https://issues.apache.org/jira/browse/HBASE-26075 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Critical > > Recently we encountered a case where size of log queue was increasing to > around 300 in few region servers in our production environment. > There were 295 wals in the oldWALs directory for that region server and the > *first file* was a 0 length file. > Replication was throwing the following error. > {noformat} > 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Failed to read stream of > replication entries > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: > java.io.EOFException: hdfs:///hbase/oldWALs/ not a > SequenceFile > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156) > Caused by: java.io.EOFException: > hdfs:///hbase/oldWALs/ not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110) > ... 1 more > {noformat} > We fixed similar error via HBASE-25536 but the zero length file was in > recovered sources. > There were more logs after the above stack trace. > {noformat} > 2021-07-05 03:06:32,757 WARN [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Couldn't get file length > information about log > hdfs:///hbase/WALs/ > 2021-07-05 03:06:32,754 INFO [20%2C1625185107182,1] > regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ > was moved to hdfs:///hbase/oldWALs/ > {noformat} > There is a special logic in ReplicationSourceWALReader thread to handle 0 > length files but we only look in WALs directory and not in oldWALs directory. > {code} > private boolean handleEofException(Exception e, WALEntryBatch batch) { > PriorityBlockingQueue queue = logQueue.getQueue(walGroupId); > // Dump the log even if logQueue size is 1 if the source is from > recovered Source > // since we don't add current log to recovered source queue so it is safe > to remove. > if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.
[ https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378487#comment-17378487 ] Hudson commented on HBASE-26075: Results for branch branch-1 [build #148 on builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/148/]: (x) *{color:red}-1 overall{color}* details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/148//General_Nightly_Build_Report/] (x) {color:red}-1 jdk7 checks{color} -- For more information [see jdk7 report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/148//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/148//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Replication is stuck due to zero length wal file in oldWALs directory. > -- > > Key: HBASE-26075 > URL: https://issues.apache.org/jira/browse/HBASE-26075 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Critical > > Recently we encountered a case where size of log queue was increasing to > around 300 in few region servers in our production environment. > There were 295 wals in the oldWALs directory for that region server and the > *first file* was a 0 length file. > Replication was throwing the following error. > {noformat} > 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Failed to read stream of > replication entries > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: > java.io.EOFException: hdfs:///hbase/oldWALs/ not a > SequenceFile > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156) > Caused by: java.io.EOFException: > hdfs:///hbase/oldWALs/ not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110) > ... 1 more > {noformat} > We fixed similar error via HBASE-25536 but the zero length file was in > recovered sources. > There were more logs after the above stack trace. > {noformat} > 2021-07-05 03:06:32,757 WARN [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Couldn't get file length > information about log > hdfs:///hbase/WALs/ > 2021-07-05 03:06:32,754 INFO
[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.
[ https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378373#comment-17378373 ] Rushabh Shah commented on HBASE-26075: -- Thank you [~bharathv] for the review and commit. Thank you [~gjacoby] for review. > Rushabh Shah Please submit a patch for master once you are freed up, merged > this in branch-1 for now. Thanks. Yes, will do early next week. > Replication is stuck due to zero length wal file in oldWALs directory. > -- > > Key: HBASE-26075 > URL: https://issues.apache.org/jira/browse/HBASE-26075 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Critical > > Recently we encountered a case where size of log queue was increasing to > around 300 in few region servers in our production environment. > There were 295 wals in the oldWALs directory for that region server and the > *first file* was a 0 length file. > Replication was throwing the following error. > {noformat} > 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Failed to read stream of > replication entries > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: > java.io.EOFException: hdfs:///hbase/oldWALs/ not a > SequenceFile > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156) > Caused by: java.io.EOFException: > hdfs:///hbase/oldWALs/ not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110) > ... 1 more > {noformat} > We fixed similar error via HBASE-25536 but the zero length file was in > recovered sources. > There were more logs after the above stack trace. > {noformat} > 2021-07-05 03:06:32,757 WARN [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Couldn't get file length > information about log > hdfs:///hbase/WALs/ > 2021-07-05 03:06:32,754 INFO [20%2C1625185107182,1] > regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ > was moved to hdfs:///hbase/oldWALs/ > {noformat} > There is a special logic in ReplicationSourceWALReader thread to handle 0 > length files but we only look in WALs directory and not in oldWALs directory. > {code} > private boolean handleEofException(Exception e, WALEntryBatch batch) { > PriorityBlockingQueue queue = logQueue.getQueue(walGroupId); > // Dump the log even if logQueue size is 1 if the source is from > recovered Source > // since we don't add current log to recovered source queue so it is safe > to remove. > if ((e instanceof
[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.
[ https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378324#comment-17378324 ] Bharath Vissapragada commented on HBASE-26075: -- [~shahrs87] Please submit a patch for master once you are freed up, merged this in branch-1 for now. Thanks. > Replication is stuck due to zero length wal file in oldWALs directory. > -- > > Key: HBASE-26075 > URL: https://issues.apache.org/jira/browse/HBASE-26075 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Critical > > Recently we encountered a case where size of log queue was increasing to > around 300 in few region servers in our production environment. > There were 295 wals in the oldWALs directory for that region server and the > *first file* was a 0 length file. > Replication was throwing the following error. > {noformat} > 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Failed to read stream of > replication entries > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: > java.io.EOFException: hdfs:///hbase/oldWALs/ not a > SequenceFile > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156) > Caused by: java.io.EOFException: > hdfs:///hbase/oldWALs/ not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110) > ... 1 more > {noformat} > We fixed similar error via HBASE-25536 but the zero length file was in > recovered sources. > There were more logs after the above stack trace. > {noformat} > 2021-07-05 03:06:32,757 WARN [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Couldn't get file length > information about log > hdfs:///hbase/WALs/ > 2021-07-05 03:06:32,754 INFO [20%2C1625185107182,1] > regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ > was moved to hdfs:///hbase/oldWALs/ > {noformat} > There is a special logic in ReplicationSourceWALReader thread to handle 0 > length files but we only look in WALs directory and not in oldWALs directory. > {code} > private boolean handleEofException(Exception e, WALEntryBatch batch) { > PriorityBlockingQueue queue = logQueue.getQueue(walGroupId); > // Dump the log even if logQueue size is 1 if the source is from > recovered Source > // since we don't add current log to recovered source queue so it is safe > to remove. > if ((e instanceof EOFException || e.getCause() instanceof EOFException) && > (source.isRecovered() || queue.size() > 1)
[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.
[ https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377535#comment-17377535 ] Rushabh Shah commented on HBASE-26075: -- I am working now to create patch. Will create PR for branch-1 first and then master branches. Will update here ASAP. > Replication is stuck due to zero length wal file in oldWALs directory. > -- > > Key: HBASE-26075 > URL: https://issues.apache.org/jira/browse/HBASE-26075 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Critical > > Recently we encountered a case where size of log queue was increasing to > around 300 in few region servers in our production environment. > There were 295 wals in the oldWALs directory for that region server and the > *first file* was a 0 length file. > Replication was throwing the following error. > {noformat} > 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Failed to read stream of > replication entries > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: > java.io.EOFException: hdfs:///hbase/oldWALs/ not a > SequenceFile > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156) > Caused by: java.io.EOFException: > hdfs:///hbase/oldWALs/ not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110) > ... 1 more > {noformat} > We fixed similar error via HBASE-25536 but the zero length file was in > recovered sources. > There were more logs after the above stack trace. > {noformat} > 2021-07-05 03:06:32,757 WARN [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Couldn't get file length > information about log > hdfs:///hbase/WALs/ > 2021-07-05 03:06:32,754 INFO [20%2C1625185107182,1] > regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ > was moved to hdfs:///hbase/oldWALs/ > {noformat} > There is a special logic in ReplicationSourceWALReader thread to handle 0 > length files but we only look in WALs directory and not in oldWALs directory. > {code} > private boolean handleEofException(Exception e, WALEntryBatch batch) { > PriorityBlockingQueue queue = logQueue.getQueue(walGroupId); > // Dump the log even if logQueue size is 1 if the source is from > recovered Source > // since we don't add current log to recovered source queue so it is safe > to remove. > if ((e instanceof EOFException || e.getCause() instanceof EOFException) && > (source.isRecovered() || queue.size() > 1) &&
[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.
[ https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377533#comment-17377533 ] Bharath Vissapragada commented on HBASE-26075: -- Hoping to get an RC out before end of this week (unless there are any unforeseen delays in the process), any chance this can be committed before then? I can help review the patch. > Replication is stuck due to zero length wal file in oldWALs directory. > -- > > Key: HBASE-26075 > URL: https://issues.apache.org/jira/browse/HBASE-26075 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Critical > > Recently we encountered a case where size of log queue was increasing to > around 300 in few region servers in our production environment. > There were 295 wals in the oldWALs directory for that region server and the > *first file* was a 0 length file. > Replication was throwing the following error. > {noformat} > 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Failed to read stream of > replication entries > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: > java.io.EOFException: hdfs:///hbase/oldWALs/ not a > SequenceFile > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156) > Caused by: java.io.EOFException: > hdfs:///hbase/oldWALs/ not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110) > ... 1 more > {noformat} > We fixed similar error via HBASE-25536 but the zero length file was in > recovered sources. > There were more logs after the above stack trace. > {noformat} > 2021-07-05 03:06:32,757 WARN [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Couldn't get file length > information about log > hdfs:///hbase/WALs/ > 2021-07-05 03:06:32,754 INFO [20%2C1625185107182,1] > regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ > was moved to hdfs:///hbase/oldWALs/ > {noformat} > There is a special logic in ReplicationSourceWALReader thread to handle 0 > length files but we only look in WALs directory and not in oldWALs directory. > {code} > private boolean handleEofException(Exception e, WALEntryBatch batch) { > PriorityBlockingQueue queue = logQueue.getQueue(walGroupId); > // Dump the log even if logQueue size is 1 if the source is from > recovered Source > // since we don't add current log to recovered source queue so it is safe > to remove. > if ((e instanceof EOFException || e.getCause()
[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.
[ https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377509#comment-17377509 ] Rushabh Shah commented on HBASE-26075: -- [~bharathv] I know you are almost finished with 1.7.1 release process. I think this is a critical issue which will block replication for days. Do you think we can add this to 1.7.1 release ? > Replication is stuck due to zero length wal file in oldWALs directory. > -- > > Key: HBASE-26075 > URL: https://issues.apache.org/jira/browse/HBASE-26075 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Critical > > Recently we encountered a case where size of log queue was increasing to > around 300 in few region servers in our production environment. > There were 295 wals in the oldWALs directory for that region server and the > *first file* was a 0 length file. > Replication was throwing the following error. > {noformat} > 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Failed to read stream of > replication entries > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: > java.io.EOFException: hdfs:///hbase/oldWALs/ not a > SequenceFile > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156) > Caused by: java.io.EOFException: > hdfs:///hbase/oldWALs/ not a SequenceFile > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110) > ... 1 more > {noformat} > We fixed similar error via HBASE-25536 but the zero length file was in > recovered sources. > There were more logs after the above stack trace. > {noformat} > 2021-07-05 03:06:32,757 WARN [20%2C1625185107182,1] > regionserver.ReplicationSourceWALReaderThread - Couldn't get file length > information about log > hdfs:///hbase/WALs/ > 2021-07-05 03:06:32,754 INFO [20%2C1625185107182,1] > regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ > was moved to hdfs:///hbase/oldWALs/ > {noformat} > There is a special logic in ReplicationSourceWALReader thread to handle 0 > length files but we only look in WALs directory and not in oldWALs directory. > {code} > private boolean handleEofException(Exception e, WALEntryBatch batch) { > PriorityBlockingQueue queue = logQueue.getQueue(walGroupId); > // Dump the log even if logQueue size is 1 if the source is from > recovered Source > // since we don't add current log to recovered source queue so it is safe > to remove. > if ((e instanceof EOFException || e.getCause() instanceof