[jira] [Commented] (HBASE-26849) NPE caused by WAL Compression and Replication

Bryan Beaudreault (Jira) Fri, 09 Feb 2024 12:25:04 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-26849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816207#comment-17816207
 ]


Bryan Beaudreault commented on HBASE-26849:
-------------------------------------------

[~tangtianhang] I have been looking at this again. I don't think this bug 
applies to branch-2 today, at least not in the way you describe. Back in 
HBASE-27632, Duo did a bunch of refactoring. There is a new flow now, where we 
set state to ERROR_AND_RESET_COMPRESSION when setting position back to 0. This 
state is handled and a call to reader.resetTo is made, which includes a boolean 
(true when the above state is set) which ensures that the CompressionContext is 
cleared.

I have not run this myself yet, but hope to work through it soon. I think we 
should probably remove the warning from our guide

> NPE caused by WAL Compression and Replication
> ---------------------------------------------
>
>                 Key: HBASE-26849
>                 URL: https://issues.apache.org/jira/browse/HBASE-26849
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication, wal
>    Affects Versions: 1.7.1, 3.0.0-alpha-2, 2.4.11
>            Reporter: tianhang tang
>            Assignee: tianhang tang
>            Priority: Critical
>         Attachments: image-2022-03-16-14-25-49-276.png, 
> image-2022-03-16-14-30-15-247.png
>
>
> My cluster uses HBase 1.4.12, opened WAL compression and replication.
> I could found replication sizeOfLogQueue backlog, and after some debugs, 
> found the NPE throwed by 
> [https://github.com/apache/hbase/blob/branch-1/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/LRUDictionary.java#L109:]
> !image-2022-03-16-14-25-49-276.png!
>  
> The root cause for this problem is:
> WALEntryStream#checkAllBytesParsed:
> !image-2022-03-16-14-30-15-247.png!
> resetReader does not create a new reader, the original CompressionContext and 
> the dict in it will still be retained.
> However, at this time, the position is reset to 0, which means that the HLog 
> needs to be read from the beginning, but the cache that has not been cleared 
> is still used, so there will be problems: the same data has already in the 
> LRUCache, and it will be directly added to the cache again.
> Recreate a new reader here, the problem is solved.
> I will open a PR later. But, there are some other places in the current code 
> to resetReader or seekOnFs. I guess these codes doesn't take into account the 
> wal compression case at all...
>  
> In theory, as long as the file is read again, the LRUCache should also be 
> rolled back, otherwise there will be inconsistent behavior of READ and WRITE 
> links.
> But the position can be roll back to any intermediate position at will, but 
> LRUCache can't...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-26849) NPE caused by WAL Compression and Replication

Reply via email to