No problem, I will read the details about HBASE-26843 tomorrow. And i think you could try HBASE-26849, but after all, my suggestion is close WAL Compression if you need to use replication...
Xiaolin Ha <summer.he...@gmail.com> 于2022年3月16日周三 22:40写道: > We have also used replication+wal compression in our production clusters. > It really seems unstable when replicating compressed cells. > I'm digging with the log queue overstock issues, and pointed out one in > issue HBASE-26843, which might be related to HBASE-26849. > We can discuss more details if you are interested in this problem. > Thanks. > > Regards > > 张铎(Duo Zhang) <palomino...@gmail.com> 于2022年3月16日周三 18:36写道: > > > +1 on updating doc first. You can file an issue for the documentation > > change, and let's also send an NOTICE email to both dev and user list to > > warn our users about this. > > > > 唐天航 <tangtianhang...@gmail.com> 于2022年3月16日周三 18:08写道: > > > > > If we only reset the position to the head, yes we can fix it. > > > In fact, 26849 is to fix the problem in this scenario. > > > But unfortunately, we have some other scenarios where we roll back the > > > position to some intermediate position, such as > > ProtobufLogReader.java#L381 > > > < > > https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java > > > /org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L381 > > > < > > > https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L381 > > > > > > > > > > I think we cannot rollback the LRUCache too... > > > While my cluster works fine after 26849, the fix is still theoretically > > > incomplete. > > > > > > 张铎(Duo Zhang) <palomino...@gmail.com> 于2022年3月16日周三 17:59写道: > > > > > > > The old WAL compression implementation is buggy when used together > with > > > > replication, that's true... > > > > > > > > But in general I think it is fixable, the dict is per file IIRC, so I > > > think > > > > clearing the LRUCache when resetting to the head of the file can fix > > the > > > > problem? > > > > > > > > Maybe we need to do some refactoring... > > > > > > > > Thanks. > > > > > > > > 唐天航 <tangtianhang...@gmail.com> 于2022年3月16日周三 16:20写道: > > > > > > > > > Hi masters, > > > > > > > > > > I have created an issue HBASE-26849 > > > > > <https://issues.apache.org/jira/browse/HBASE-26849> about NPE > caused > > > by > > > > > WAL > > > > > Compression and Replication. > > > > > > > > > > For this problem, I try to reopen a WAL reader when we reset the > > > position > > > > > to 0 and it looks like it's working well. But it didn't > fundamentally > > > > solve > > > > > the problem. > > > > > > > > > > Since we have the WAL Compression feature, Replication has > > introduced a > > > > lot > > > > > of new code, and there are many places that reset the HLog > position, > > > such > > > > > as seekOnFs to originalPosition. I guess none of these codes > consider > > > > > compatibility with WAL Compression. Because theoretically we can > roll > > > > back > > > > > the position to any position at any time, but the LRUCache in the > > > > > corresponding LRUDictionary should also be rolled back, otherwise > the > > > > read > > > > > and write link behavior may be inconsistent. But LRUCache can't > roll > > > back > > > > > at all... > > > > > > > > > > So my thought is, open another issue and add some description in > the > > > doc, > > > > > WAL Compression and Replication are not compatible. > > > > > > > > > > What do you think? > > > > > > > > > > Thank you. Regards > > > > > > > > > > > > > > >