Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

唐天航 Wed, 16 Mar 2022 20:07:12 -0700

BTW, I think this issue might also be helpful for HBASE-15983.

唐天航 <tangtianhang...@gmail.com> 于2022年3月17日周四 10:33写道：


> Hi xiaolin,
>
> I have read HBASE-26843 just now.
>
> If you use WAL Compression, I think this issue might be related to
> HBASE-26849.
> When you get a NullPointerException or IndexOutOfBoundsException from
> LRUDictionary, ProtobufLogReader will wrap it as an EOFException and try to
> roll back the position and try again.
> But the core problem is that your LRUCache has been polluted, and retrying
> will not solve the problem, you will only get an ever-increasing queue
> overstock.
> In other words, you can also try throwing the original Exception directly,
> so that ReplicationSourceWALReaderThread will catch a
> WALEntryStreamRuntimeException and try again. At this layer the reader will
> be recreated (you will get a clean dict), and then you may find that there
> is no more log overstock.
>
>
> 唐天航 <tangtianhang...@gmail.com> 于2022年3月16日周三 23:10写道：
>
>> No problem, I will read the details about HBASE-26843 tomorrow.
>> And i think you could try HBASE-26849, but after all, my suggestion is
>> close WAL Compression if you need to use replication...
>>
>> Xiaolin Ha <summer.he...@gmail.com> 于2022年3月16日周三 22:40写道：
>>
>>> We have also used replication+wal compression in our production clusters.
>>> It really seems unstable when replicating compressed cells.
>>> I'm digging with the log queue overstock issues, and pointed out one in
>>> issue HBASE-26843, which might be related to HBASE-26849.
>>> We can discuss more details if you are interested in this problem.
>>> Thanks.
>>>
>>> Regards
>>>
>>> 张铎(Duo Zhang) <palomino...@gmail.com> 于2022年3月16日周三 18:36写道：
>>>
>>> > +1 on updating doc first. You can file an issue for the documentation
>>> > change, and let's also send an NOTICE email to both dev and user list
>>> to
>>> > warn our users about this.
>>> >
>>> > 唐天航 <tangtianhang...@gmail.com> 于2022年3月16日周三 18:08写道：
>>> >
>>> > > If we only reset the position to the head, yes we can fix it.
>>> > > In fact, 26849 is to fix the problem in this scenario.
>>> > > But unfortunately, we have some other scenarios where we roll back
>>> the
>>> > > position to some intermediate position, such as
>>> > ProtobufLogReader.java#L381
>>> > > <
>>> >
>>> https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java
>>> > > /org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L381
>>> > > <
>>> >
>>> https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L381
>>> > >
>>> > > >
>>> > > I think we cannot rollback the LRUCache too...
>>> > > While my cluster works fine after 26849, the fix is still
>>> theoretically
>>> > > incomplete.
>>> > >
>>> > > 张铎(Duo Zhang) <palomino...@gmail.com> 于2022年3月16日周三 17:59写道：
>>> > >
>>> > > > The old WAL compression implementation is buggy when used together
>>> with
>>> > > > replication, that's true...
>>> > > >
>>> > > > But in general I think it is fixable, the dict is per file IIRC,
>>> so I
>>> > > think
>>> > > > clearing the LRUCache when resetting to the head of the file can
>>> fix
>>> > the
>>> > > > problem?
>>> > > >
>>> > > > Maybe we need to do some refactoring...
>>> > > >
>>> > > > Thanks.
>>> > > >
>>> > > > 唐天航 <tangtianhang...@gmail.com> 于2022年3月16日周三 16:20写道：
>>> > > >
>>> > > > > Hi masters,
>>> > > > >
>>> > > > > I have created an issue HBASE-26849
>>> > > > > <https://issues.apache.org/jira/browse/HBASE-26849> about NPE
>>> caused
>>> > > by
>>> > > > > WAL
>>> > > > > Compression and Replication.
>>> > > > >
>>> > > > > For this problem, I try to reopen a WAL reader when we reset the
>>> > > position
>>> > > > > to 0 and it looks like it's working well. But it didn't
>>> fundamentally
>>> > > > solve
>>> > > > > the problem.
>>> > > > >
>>> > > > > Since we have the WAL Compression feature, Replication has
>>> > introduced a
>>> > > > lot
>>> > > > > of new code, and there are many places that reset the HLog
>>> position,
>>> > > such
>>> > > > > as seekOnFs to originalPosition. I guess none of these codes
>>> consider
>>> > > > > compatibility with WAL Compression. Because theoretically we can
>>> roll
>>> > > > back
>>> > > > > the position to any position at any time, but the LRUCache in the
>>> > > > > corresponding LRUDictionary should also be rolled back,
>>> otherwise the
>>> > > > read
>>> > > > > and write link behavior may be inconsistent. But LRUCache can't
>>> roll
>>> > > back
>>> > > > > at all...
>>> > > > >
>>> > > > > So my thought is, open another issue and add some description in
>>> the
>>> > > doc,
>>> > > > > WAL Compression and Replication are not compatible.
>>> > > > >
>>> > > > > What do you think?
>>> > > > >
>>> > > > > Thank you. Regards
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

Reply via email to