[ 
https://issues.apache.org/jira/browse/HBASE-26843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507953#comment-17507953
 ] 

tianhang tang commented on HBASE-26843:
---------------------------------------

If you use WAL Compression, I think this issue might be related to 
[HBASE-26849|https://issues.apache.org/jira/browse/HBASE-26849].

When you get a NullPointerException or IndexOutOfBoundsException, 
ProtobufLogReader will wrap it as an EOFException and try to roll back the 
position and try again.
But the core problem is that your LRUCache has been polluted, and retrying will 
not solve the problem, you will only get a constant backlog of logQueue.
In other words, you can also try throwing the original Exception directly, so 
that ReplicationSourceWALReaderThread will catch a 
WALEntryStreamRuntimeException and try again. At this layer the reader will be 
recreated (you will get a clean dict), and then you may find that there is no 
more log backlog.

> When encounters EOF exception, replication log will not be dequeued
> -------------------------------------------------------------------
>
>                 Key: HBASE-26843
>                 URL: https://issues.apache.org/jira/browse/HBASE-26843
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.4.10
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>             Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.12
>
>
> We met this problem on our production clusters, and in the pre-commit tests, 
> it also occured. 
> We can see this test logs,
>  
> {code:java}
> 2022-03-14T15:31:16,225 INFO  [Time-limited test] wal.AbstractFSWAL(840): 
> Rolled WAL 
> /home/jenkins/jenkins-home/workspace/Base-PreCommit-GitHub-PR_PR-4219/yetus-jdk8-hadoop3-check/src/hbase-server/target/test-data/b3808e70-5b17-10bb-3bd3-fdd825fd8fab/WALs/testSizeOfLogQueue/testSizeOfLogQueue.1647271876165
>  with entries=1, filesize=210 B; new WAL 
> /home/jenkins/jenkins-home/workspace/Base-PreCommit-GitHub-PR_PR-4219/yetus-jdk8-hadoop3-check/src/hbase-server/target/test-data/b3808e70-5b17-10bb-3bd3-fdd825fd8fab/WALs/testSizeOfLogQueue/testSizeOfLogQueue.1647271876191
> 2022-03-14T15:31:16,225 DEBUG [Time-limited test] wal.AbstractFSWAL(927): 
> Create new AsyncFSWAL writer with pipeline: 
> [DatanodeInfoWithStorage[127.0.0.1:34961,DS-77496ab4-d8f2-4bee-af67-6f1c3e08de20,DISK],
>  
> DatanodeInfoWithStorage[127.0.0.1:38455,DS-50a46a7f-6202-4177-a91f-7da535dbf07f,DISK],
>  
> DatanodeInfoWithStorage[127.0.0.1:40111,DS-d41b66a9-f9d5-4a57-9969-ac93cae6c9ac,DISK]]
> 2022-03-14T15:31:16,338 WARN  [Close-WAL-Writer-0] 
> asyncfs.FanOutOneBlockAsyncDFSOutputHelper(593): complete file 
> /home/jenkins/jenkins-home/workspace/Base-PreCommit-GitHub-PR_PR-4219/yetus-jdk8-hadoop3-check/src/hbase-server/target/test-data/b3808e70-5b17-10bb-3bd3-fdd825fd8fab/WALs/testSizeOfLogQueue/testSizeOfLogQueue.1647271876165
>  not finished, retry = 0
> 2022-03-14T15:31:16,357 DEBUG [Time-limited test] wal.ProtobufLogReader(456): 
> Encountered a malformed edit, seeking back to last good position in file, 
> from 211 to 210
> java.io.EOFException: Partial PB while reading WAL, probably an unexpected 
> EOF, ignoring. current offset=211
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:399)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:105) 
> ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:93) 
> ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:264)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:178)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:103)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testSizeOfLogQueue(TestBasicWALEntryStream.java:698)
>  ~[test-classes/:?]
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:1.8.0_282]
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_282]
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_282]
>       at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_282]
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>  ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>  ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
> ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>  ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) 
> ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>  ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>  ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) 
> ~[junit-4.13.2.jar:4.13.2]
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) 
> ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.apache.hadoop.hbase.SystemExitRule$1.evaluate(SystemExitRule.java:38) 
> ~[hbase-common-3.0.0-alpha-3-SNAPSHOT-tests.jar:3.0.0-alpha-3-SNAPSHOT]
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
>  ~[junit-4.13.2.jar:4.13.2]
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
>  ~[junit-4.13.2.jar:4.13.2]
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> ~[?:1.8.0_282]
>       at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_282]
> 2022-03-14T15:31:16,511 DEBUG [Time-limited test] wal.AbstractFSWAL(1050): 
> Moved 2 WAL file(s) to 
> /home/jenkins/jenkins-home/workspace/Base-PreCommit-GitHub-PR_PR-4219/yetus-jdk8-hadoop3-check/src/hbase-server/target/test-data/b3808e70-5b17-10bb-3bd3-fdd825fd8fab/oldWALs
> 2022-03-14T15:31:16,511 INFO  [Time-limited test] wal.AbstractFSWAL(1053): 
> Closed WAL: AsyncFSWAL testSizeOfLogQueue:(num 1647271876191)
>  {code}
> And this is the failure info,
>  
>  
> {code:java}
> Failedprecommit checks / yetus jdk8 Hadoop3 checks / 
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamFSHLog.testSizeOfLogQueueFailing
>  for the past 1 build (Since #1 )运行时间:0.41 secError Messageexpected:<1> but 
> was:<2>Stacktracejava.lang.AssertionError: expected:<1> but was:<2>
>       at org.junit.Assert.fail(Assert.java:89)
>       at org.junit.Assert.failNotEquals(Assert.java:835)
>       at org.junit.Assert.assertEquals(Assert.java:647)
>       at org.junit.Assert.assertEquals(Assert.java:633)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testSizeOfLogQueue(TestBasicWALEntryStream.java:701)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>       at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
>       at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>  {code}
> The link is 
> [https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4219/1/testReport/org.apache.hadoop.hbase.replication.regionserver/TestBasicWALEntryStreamFSHLog/precommit_checks___yetus_jdk8_Hadoop3_checks___testSizeOfLogQueue/]
>  
>  
> I can not figure out what the problem is yet, but we can change the UT to 
> find more info.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to