[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-15 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381772#comment-17381772
 ] 

Bharath Vissapragada commented on HBASE-26075:
--

[~shahrs87] Can you please submit the PRs for master/branch-2 as a part of the 
child task? marking this resolved for 1.7.1 and the release work is in progress.

> Replication is stuck due to zero length wal file in oldWALs directory.
> --
>
> Key: HBASE-26075
> URL: https://issues.apache.org/jira/browse/HBASE-26075
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Critical
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
> SequenceFile
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs:///hbase/oldWALs/ not a SequenceFile
> at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs:///hbase/WALs/
> 2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
> regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
> was moved to hdfs:///hbase/oldWALs/
> {noformat}
> There is a special logic in ReplicationSourceWALReader thread to handle 0 
> length files but we only look in WALs directory and not in oldWALs directory.
> {code}
>   private boolean handleEofException(Exception e, WALEntryBatch batch) {
> PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
> // Dump the log even if logQueue size is 1 if the source is from 
> recovered Source
> // since we don't add current log to recovered source queue so it is safe 
> to remove.
> if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&

[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-10 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378487#comment-17378487
 ] 

Hudson commented on HBASE-26075:


Results for branch branch-1
[build #148 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/148/]:
 (x) *{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/148//General_Nightly_Build_Report/]


(x) {color:red}-1 jdk7 checks{color}
-- For more information [see jdk7 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/148//JDK7_Nightly_Build_Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/148//JDK8_Nightly_Build_Report_(Hadoop2)/]




(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> Replication is stuck due to zero length wal file in oldWALs directory.
> --
>
> Key: HBASE-26075
> URL: https://issues.apache.org/jira/browse/HBASE-26075
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Critical
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
> SequenceFile
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs:///hbase/oldWALs/ not a SequenceFile
> at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs:///hbase/WALs/
> 2021-07-05 03:06:32,754 INFO  

[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-09 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378373#comment-17378373
 ] 

Rushabh Shah commented on HBASE-26075:
--

Thank you [~bharathv] for the review and commit. Thank you [~gjacoby] for 
review.

> Rushabh Shah Please submit a patch for master once you are freed up, merged 
> this in branch-1 for now. Thanks.
Yes, will do early next week.


> Replication is stuck due to zero length wal file in oldWALs directory.
> --
>
> Key: HBASE-26075
> URL: https://issues.apache.org/jira/browse/HBASE-26075
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Critical
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
> SequenceFile
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs:///hbase/oldWALs/ not a SequenceFile
> at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs:///hbase/WALs/
> 2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
> regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
> was moved to hdfs:///hbase/oldWALs/
> {noformat}
> There is a special logic in ReplicationSourceWALReader thread to handle 0 
> length files but we only look in WALs directory and not in oldWALs directory.
> {code}
>   private boolean handleEofException(Exception e, WALEntryBatch batch) {
> PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
> // Dump the log even if logQueue size is 1 if the source is from 
> recovered Source
> // since we don't add current log to recovered source queue so it is safe 
> to remove.
> if ((e instanceof 

[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-09 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378324#comment-17378324
 ] 

Bharath Vissapragada commented on HBASE-26075:
--

[~shahrs87] Please submit a patch for master once you are freed up, merged this 
in branch-1 for now. Thanks.

> Replication is stuck due to zero length wal file in oldWALs directory.
> --
>
> Key: HBASE-26075
> URL: https://issues.apache.org/jira/browse/HBASE-26075
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Critical
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
> SequenceFile
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs:///hbase/oldWALs/ not a SequenceFile
> at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs:///hbase/WALs/
> 2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
> regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
> was moved to hdfs:///hbase/oldWALs/
> {noformat}
> There is a special logic in ReplicationSourceWALReader thread to handle 0 
> length files but we only look in WALs directory and not in oldWALs directory.
> {code}
>   private boolean handleEofException(Exception e, WALEntryBatch batch) {
> PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
> // Dump the log even if logQueue size is 1 if the source is from 
> recovered Source
> // since we don't add current log to recovered source queue so it is safe 
> to remove.
> if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
>   (source.isRecovered() || queue.size() > 1) 

[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-08 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377535#comment-17377535
 ] 

Rushabh Shah commented on HBASE-26075:
--

I am working now to create patch. Will create PR for branch-1 first and then 
master branches. Will update here ASAP.

> Replication is stuck due to zero length wal file in oldWALs directory.
> --
>
> Key: HBASE-26075
> URL: https://issues.apache.org/jira/browse/HBASE-26075
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Critical
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
> SequenceFile
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs:///hbase/oldWALs/ not a SequenceFile
> at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs:///hbase/WALs/
> 2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
> regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
> was moved to hdfs:///hbase/oldWALs/
> {noformat}
> There is a special logic in ReplicationSourceWALReader thread to handle 0 
> length files but we only look in WALs directory and not in oldWALs directory.
> {code}
>   private boolean handleEofException(Exception e, WALEntryBatch batch) {
> PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
> // Dump the log even if logQueue size is 1 if the source is from 
> recovered Source
> // since we don't add current log to recovered source queue so it is safe 
> to remove.
> if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
>   (source.isRecovered() || queue.size() > 1) && 

[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-08 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377533#comment-17377533
 ] 

Bharath Vissapragada commented on HBASE-26075:
--

Hoping to get an RC out before end of this week (unless there are any 
unforeseen delays in the process), any chance this can be committed before 
then? I can help review the patch.

> Replication is stuck due to zero length wal file in oldWALs directory.
> --
>
> Key: HBASE-26075
> URL: https://issues.apache.org/jira/browse/HBASE-26075
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Critical
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
> SequenceFile
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs:///hbase/oldWALs/ not a SequenceFile
> at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs:///hbase/WALs/
> 2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
> regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
> was moved to hdfs:///hbase/oldWALs/
> {noformat}
> There is a special logic in ReplicationSourceWALReader thread to handle 0 
> length files but we only look in WALs directory and not in oldWALs directory.
> {code}
>   private boolean handleEofException(Exception e, WALEntryBatch batch) {
> PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
> // Dump the log even if logQueue size is 1 if the source is from 
> recovered Source
> // since we don't add current log to recovered source queue so it is safe 
> to remove.
> if ((e instanceof EOFException || e.getCause() 

[jira] [Commented] (HBASE-26075) Replication is stuck due to zero length wal file in oldWALs directory.

2021-07-08 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377509#comment-17377509
 ] 

Rushabh Shah commented on HBASE-26075:
--

[~bharathv] I know you are almost finished with 1.7.1 release process. I think 
this is a critical issue which will block replication for days. Do you think we 
can add this to 1.7.1 release ?

> Replication is stuck due to zero length wal file in oldWALs directory.
> --
>
> Key: HBASE-26075
> URL: https://issues.apache.org/jira/browse/HBASE-26075
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.3.5, 2.4.4
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Critical
>
> Recently we encountered a case where size of log queue was increasing to 
> around 300 in few region servers in our production environment.
> There were 295 wals in the oldWALs directory for that region server and the 
> *first file* was a 0 length file.
> Replication was throwing the following error.
> {noformat}
> 2021-07-05 03:06:32,757 ERROR [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Failed to read stream of 
> replication entries
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
>  java.io.EOFException: hdfs:///hbase/oldWALs/ not a 
> SequenceFile
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:112)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:156)
> Caused by: java.io.EOFException: 
> hdfs:///hbase/oldWALs/ not a SequenceFile
> at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1842)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1856)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.(SequenceFileLogReader.java:70)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
> at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
> at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
> at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:352)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:341)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:359)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:316)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:306)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:207)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
> ... 1 more
> {noformat}
> We fixed similar error  via HBASE-25536 but the zero length file was in 
> recovered sources.
> There were more logs after the above stack trace.
> {noformat}
> 2021-07-05 03:06:32,757 WARN  [20%2C1625185107182,1] 
> regionserver.ReplicationSourceWALReaderThread - Couldn't get file length 
> information about log 
> hdfs:///hbase/WALs/
> 2021-07-05 03:06:32,754 INFO  [20%2C1625185107182,1] 
> regionserver.WALEntryStream - Log hdfs:///hbase/WALs/ 
> was moved to hdfs:///hbase/oldWALs/
> {noformat}
> There is a special logic in ReplicationSourceWALReader thread to handle 0 
> length files but we only look in WALs directory and not in oldWALs directory.
> {code}
>   private boolean handleEofException(Exception e, WALEntryBatch batch) {
> PriorityBlockingQueue queue = logQueue.getQueue(walGroupId);
> // Dump the log even if logQueue size is 1 if the source is from 
> recovered Source
> // since we don't add current log to recovered source queue so it is safe 
> to remove.
> if ((e instanceof EOFException || e.getCause() instanceof