[ https://issues.apache.org/jira/browse/HBASE-26849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816207#comment-17816207 ]
Bryan Beaudreault commented on HBASE-26849: ------------------------------------------- [~tangtianhang] I have been looking at this again. I don't think this bug applies to branch-2 today, at least not in the way you describe. Back in HBASE-27632, Duo did a bunch of refactoring. There is a new flow now, where we set state to ERROR_AND_RESET_COMPRESSION when setting position back to 0. This state is handled and a call to reader.resetTo is made, which includes a boolean (true when the above state is set) which ensures that the CompressionContext is cleared. I have not run this myself yet, but hope to work through it soon. I think we should probably remove the warning from our guide > NPE caused by WAL Compression and Replication > --------------------------------------------- > > Key: HBASE-26849 > URL: https://issues.apache.org/jira/browse/HBASE-26849 > Project: HBase > Issue Type: Bug > Components: Replication, wal > Affects Versions: 1.7.1, 3.0.0-alpha-2, 2.4.11 > Reporter: tianhang tang > Assignee: tianhang tang > Priority: Critical > Attachments: image-2022-03-16-14-25-49-276.png, > image-2022-03-16-14-30-15-247.png > > > My cluster uses HBase 1.4.12, opened WAL compression and replication. > I could found replication sizeOfLogQueue backlog, and after some debugs, > found the NPE throwed by > [https://github.com/apache/hbase/blob/branch-1/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/LRUDictionary.java#L109:] > !image-2022-03-16-14-25-49-276.png! > > The root cause for this problem is: > WALEntryStream#checkAllBytesParsed: > !image-2022-03-16-14-30-15-247.png! > resetReader does not create a new reader, the original CompressionContext and > the dict in it will still be retained. > However, at this time, the position is reset to 0, which means that the HLog > needs to be read from the beginning, but the cache that has not been cleared > is still used, so there will be problems: the same data has already in the > LRUCache, and it will be directly added to the cache again. > Recreate a new reader here, the problem is solved. > I will open a PR later. But, there are some other places in the current code > to resetReader or seekOnFs. I guess these codes doesn't take into account the > wal compression case at all... > > In theory, as long as the file is read again, the LRUCache should also be > rolled back, otherwise there will be inconsistent behavior of READ and WRITE > links. > But the position can be roll back to any intermediate position at will, but > LRUCache can't... -- This message was sent by Atlassian Jira (v8.20.10#820010)