BTW, I think this issue might also be helpful for HBASE-15983. 唐天航 <tangtianhang...@gmail.com> 于2022年3月17日周四 10:33写道:
> Hi xiaolin, > > I have read HBASE-26843 just now. > > If you use WAL Compression, I think this issue might be related to > HBASE-26849. > When you get a NullPointerException or IndexOutOfBoundsException from > LRUDictionary, ProtobufLogReader will wrap it as an EOFException and try to > roll back the position and try again. > But the core problem is that your LRUCache has been polluted, and retrying > will not solve the problem, you will only get an ever-increasing queue > overstock. > In other words, you can also try throwing the original Exception directly, > so that ReplicationSourceWALReaderThread will catch a > WALEntryStreamRuntimeException and try again. At this layer the reader will > be recreated (you will get a clean dict), and then you may find that there > is no more log overstock. > > > 唐天航 <tangtianhang...@gmail.com> 于2022年3月16日周三 23:10写道: > >> No problem, I will read the details about HBASE-26843 tomorrow. >> And i think you could try HBASE-26849, but after all, my suggestion is >> close WAL Compression if you need to use replication... >> >> Xiaolin Ha <summer.he...@gmail.com> 于2022年3月16日周三 22:40写道: >> >>> We have also used replication+wal compression in our production clusters. >>> It really seems unstable when replicating compressed cells. >>> I'm digging with the log queue overstock issues, and pointed out one in >>> issue HBASE-26843, which might be related to HBASE-26849. >>> We can discuss more details if you are interested in this problem. >>> Thanks. >>> >>> Regards >>> >>> 张铎(Duo Zhang) <palomino...@gmail.com> 于2022年3月16日周三 18:36写道: >>> >>> > +1 on updating doc first. You can file an issue for the documentation >>> > change, and let's also send an NOTICE email to both dev and user list >>> to >>> > warn our users about this. >>> > >>> > 唐天航 <tangtianhang...@gmail.com> 于2022年3月16日周三 18:08写道: >>> > >>> > > If we only reset the position to the head, yes we can fix it. >>> > > In fact, 26849 is to fix the problem in this scenario. >>> > > But unfortunately, we have some other scenarios where we roll back >>> the >>> > > position to some intermediate position, such as >>> > ProtobufLogReader.java#L381 >>> > > < >>> > >>> https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java >>> > > /org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L381 >>> > > < >>> > >>> https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L381 >>> > > >>> > > > >>> > > I think we cannot rollback the LRUCache too... >>> > > While my cluster works fine after 26849, the fix is still >>> theoretically >>> > > incomplete. >>> > > >>> > > 张铎(Duo Zhang) <palomino...@gmail.com> 于2022年3月16日周三 17:59写道: >>> > > >>> > > > The old WAL compression implementation is buggy when used together >>> with >>> > > > replication, that's true... >>> > > > >>> > > > But in general I think it is fixable, the dict is per file IIRC, >>> so I >>> > > think >>> > > > clearing the LRUCache when resetting to the head of the file can >>> fix >>> > the >>> > > > problem? >>> > > > >>> > > > Maybe we need to do some refactoring... >>> > > > >>> > > > Thanks. >>> > > > >>> > > > 唐天航 <tangtianhang...@gmail.com> 于2022年3月16日周三 16:20写道: >>> > > > >>> > > > > Hi masters, >>> > > > > >>> > > > > I have created an issue HBASE-26849 >>> > > > > <https://issues.apache.org/jira/browse/HBASE-26849> about NPE >>> caused >>> > > by >>> > > > > WAL >>> > > > > Compression and Replication. >>> > > > > >>> > > > > For this problem, I try to reopen a WAL reader when we reset the >>> > > position >>> > > > > to 0 and it looks like it's working well. But it didn't >>> fundamentally >>> > > > solve >>> > > > > the problem. >>> > > > > >>> > > > > Since we have the WAL Compression feature, Replication has >>> > introduced a >>> > > > lot >>> > > > > of new code, and there are many places that reset the HLog >>> position, >>> > > such >>> > > > > as seekOnFs to originalPosition. I guess none of these codes >>> consider >>> > > > > compatibility with WAL Compression. Because theoretically we can >>> roll >>> > > > back >>> > > > > the position to any position at any time, but the LRUCache in the >>> > > > > corresponding LRUDictionary should also be rolled back, >>> otherwise the >>> > > > read >>> > > > > and write link behavior may be inconsistent. But LRUCache can't >>> roll >>> > > back >>> > > > > at all... >>> > > > > >>> > > > > So my thought is, open another issue and add some description in >>> the >>> > > doc, >>> > > > > WAL Compression and Replication are not compatible. >>> > > > > >>> > > > > What do you think? >>> > > > > >>> > > > > Thank you. Regards >>> > > > > >>> > > > >>> > > >>> > >>> >>