Hi xiaolin,

I have read HBASE-26843 just now.

If you use WAL Compression, I think this issue might be related to
HBASE-26849.
When you get a NullPointerException or IndexOutOfBoundsException from
LRUDictionary, ProtobufLogReader will wrap it as an EOFException and try to
roll back the position and try again.
But the core problem is that your LRUCache has been polluted, and retrying
will not solve the problem, you will only get an ever-increasing queue
overstock.
In other words, you can also try throwing the original Exception directly,
so that ReplicationSourceWALReaderThread will catch a
WALEntryStreamRuntimeException and try again. At this layer the reader will
be recreated (you will get a clean dict), and then you may find that there
is no more log overstock.


唐天航 <tangtianhang...@gmail.com> 于2022年3月16日周三 23:10写道:

> No problem, I will read the details about HBASE-26843 tomorrow.
> And i think you could try HBASE-26849, but after all, my suggestion is
> close WAL Compression if you need to use replication...
>
> Xiaolin Ha <summer.he...@gmail.com> 于2022年3月16日周三 22:40写道:
>
>> We have also used replication+wal compression in our production clusters.
>> It really seems unstable when replicating compressed cells.
>> I'm digging with the log queue overstock issues, and pointed out one in
>> issue HBASE-26843, which might be related to HBASE-26849.
>> We can discuss more details if you are interested in this problem.
>> Thanks.
>>
>> Regards
>>
>> 张铎(Duo Zhang) <palomino...@gmail.com> 于2022年3月16日周三 18:36写道:
>>
>> > +1 on updating doc first. You can file an issue for the documentation
>> > change, and let's also send an NOTICE email to both dev and user list to
>> > warn our users about this.
>> >
>> > 唐天航 <tangtianhang...@gmail.com> 于2022年3月16日周三 18:08写道:
>> >
>> > > If we only reset the position to the head, yes we can fix it.
>> > > In fact, 26849 is to fix the problem in this scenario.
>> > > But unfortunately, we have some other scenarios where we roll back the
>> > > position to some intermediate position, such as
>> > ProtobufLogReader.java#L381
>> > > <
>> >
>> https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java
>> > > /org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L381
>> > > <
>> >
>> https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L381
>> > >
>> > > >
>> > > I think we cannot rollback the LRUCache too...
>> > > While my cluster works fine after 26849, the fix is still
>> theoretically
>> > > incomplete.
>> > >
>> > > 张铎(Duo Zhang) <palomino...@gmail.com> 于2022年3月16日周三 17:59写道:
>> > >
>> > > > The old WAL compression implementation is buggy when used together
>> with
>> > > > replication, that's true...
>> > > >
>> > > > But in general I think it is fixable, the dict is per file IIRC, so
>> I
>> > > think
>> > > > clearing the LRUCache when resetting to the head of the file can fix
>> > the
>> > > > problem?
>> > > >
>> > > > Maybe we need to do some refactoring...
>> > > >
>> > > > Thanks.
>> > > >
>> > > > 唐天航 <tangtianhang...@gmail.com> 于2022年3月16日周三 16:20写道:
>> > > >
>> > > > > Hi masters,
>> > > > >
>> > > > > I have created an issue HBASE-26849
>> > > > > <https://issues.apache.org/jira/browse/HBASE-26849> about NPE
>> caused
>> > > by
>> > > > > WAL
>> > > > > Compression and Replication.
>> > > > >
>> > > > > For this problem, I try to reopen a WAL reader when we reset the
>> > > position
>> > > > > to 0 and it looks like it's working well. But it didn't
>> fundamentally
>> > > > solve
>> > > > > the problem.
>> > > > >
>> > > > > Since we have the WAL Compression feature, Replication has
>> > introduced a
>> > > > lot
>> > > > > of new code, and there are many places that reset the HLog
>> position,
>> > > such
>> > > > > as seekOnFs to originalPosition. I guess none of these codes
>> consider
>> > > > > compatibility with WAL Compression. Because theoretically we can
>> roll
>> > > > back
>> > > > > the position to any position at any time, but the LRUCache in the
>> > > > > corresponding LRUDictionary should also be rolled back, otherwise
>> the
>> > > > read
>> > > > > and write link behavior may be inconsistent. But LRUCache can't
>> roll
>> > > back
>> > > > > at all...
>> > > > >
>> > > > > So my thought is, open another issue and add some description in
>> the
>> > > doc,
>> > > > > WAL Compression and Replication are not compatible.
>> > > > >
>> > > > > What do you think?
>> > > > >
>> > > > > Thank you. Regards
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to