Junegunn Choi created HBASE-29480:
-------------------------------------
Summary: testCleanClosedWALs fails on branch-2.5
Key: HBASE-29480
URL: https://issues.apache.org/jira/browse/HBASE-29480
Project: HBase
Issue Type: Bug
Reporter: Junegunn Choi
h2. Problem
I'm unable to make {{testCleanClosedWALs}} pass on branch-2.5.
{noformat}
[ERROR] Failures:
[ERROR]
org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamAsyncFSWAL.testCleanClosedWALs
[ERROR] Run 1:
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
expected:<0> but was:<1>
[ERROR] Run 2:
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
expected:<0> but was:<1>
[ERROR] Run 3:
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
expected:<0> but was:<1>
[ERROR] Run 4:
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
expected:<0> but was:<1>
[INFO]
[ERROR]
org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamFSHLog.testCleanClosedWALs
[ERROR] Run 1:
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727
expected:<0> but was:<1>
[ERROR] Run 2:
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727
expected:<0> but was:<1>
[ERROR] Run 3:
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727
expected:<0> but was:<1>
[ERROR] Run 4:
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727
expected:<0> but was:<1>
{noformat}
h2. Workaround
If I add a short sleep after {{rollWriter}} (as suggested in HBASE-25932), the
tests pass.
{code:java}
diff --git
a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
index a997cd0a9d..7c47732259 100644
---
a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
+++
b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
@@ -722,6 +722,7 @@ public abstract class TestBasicWALEntryStream extends
WALEntryStreamTestBase {
appendToLogAndSync();
assertNotNull(entryStream.next());
log.rollWriter();
+ Thread.sleep(10);
appendToLogAndSync();
assertNotNull(entryStream.next());
assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
{code}
h2. Analysis
When the test fails, the following log appears:
{noformat}
2025-07-29T14:21:12,524 DEBUG [Time-limited test] wal.ProtobufLogReader(447):
Encountered a malformed edit, seeking back to last good position in file, from
218 to 210
java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210,
currentPosition=218
at
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376)
~[classes/:?]
at
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104)
~[classes/:?]
at
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92)
~[classes/:?]
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259)
~[classes/:?]
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181)
~[classes/:?]
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102)
~[classes/:?]
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111)
~[classes/:?]
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118)
~[classes/:?]
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82)
~[test-classes/:?]
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95)
~[test-classes/:?]
at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184)
~[test-classes/:?]
at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135)
~[test-classes/:?]
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94)
~[test-classes/:?]
at
org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:727)
~[test-classes/:?]
{noformat}
Which led me to this code:
[https://github.com/apache/hbase/blob/c319c6f0a316446342fd9a03ba72d14e8dc7cb36/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L372-L376]
The logic here seems to contradict the comment ("only rethrow EOF if it
indicates an EOF") and throw EOF even when it has reached the WAL trailer. (Or
maybe I misread the comment?)
If I change the code to _not_ throw the exception when the WAL trailer is
found, the test passes.
While I have a limited understanding of this area, this behavior seems
reasonable: if we have reached the trailer, there's no more entry, so we can
just return {{{}false{}}}.
Any thoughts?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)