Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

唐天航 Wed, 16 Mar 2022 08:10:43 -0700

No problem, I will read the details about HBASE-26843 tomorrow.
And i think you could try HBASE-26849, but after all, my suggestion is
close WAL Compression if you need to use replication...


Xiaolin Ha <[email protected]> 于2022年3月16日周三 22:40写道：

> We have also used replication+wal compression in our production clusters.
> It really seems unstable when replicating compressed cells.
> I'm digging with the log queue overstock issues, and pointed out one in
> issue HBASE-26843, which might be related to HBASE-26849.
> We can discuss more details if you are interested in this problem.
> Thanks.
>
> Regards
>
> 张铎(Duo Zhang) <[email protected]> 于2022年3月16日周三 18:36写道：
>
> > +1 on updating doc first. You can file an issue for the documentation
> > change, and let's also send an NOTICE email to both dev and user list to
> > warn our users about this.
> >
> > 唐天航 <[email protected]> 于2022年3月16日周三 18:08写道：
> >
> > > If we only reset the position to the head, yes we can fix it.
> > > In fact, 26849 is to fix the problem in this scenario.
> > > But unfortunately, we have some other scenarios where we roll back the
> > > position to some intermediate position, such as
> > ProtobufLogReader.java#L381
> > > <
> > https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java
> > > /org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L381
> > > <
> >
> https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L381
> > >
> > > >
> > > I think we cannot rollback the LRUCache too...
> > > While my cluster works fine after 26849, the fix is still theoretically
> > > incomplete.
> > >
> > > 张铎(Duo Zhang) <[email protected]> 于2022年3月16日周三 17:59写道：
> > >
> > > > The old WAL compression implementation is buggy when used together
> with
> > > > replication, that's true...
> > > >
> > > > But in general I think it is fixable, the dict is per file IIRC, so I
> > > think
> > > > clearing the LRUCache when resetting to the head of the file can fix
> > the
> > > > problem?
> > > >
> > > > Maybe we need to do some refactoring...
> > > >
> > > > Thanks.
> > > >
> > > > 唐天航 <[email protected]> 于2022年3月16日周三 16:20写道：
> > > >
> > > > > Hi masters,
> > > > >
> > > > > I have created an issue HBASE-26849
> > > > > <https://issues.apache.org/jira/browse/HBASE-26849> about NPE
> caused
> > > by
> > > > > WAL
> > > > > Compression and Replication.
> > > > >
> > > > > For this problem, I try to reopen a WAL reader when we reset the
> > > position
> > > > > to 0 and it looks like it's working well. But it didn't
> fundamentally
> > > > solve
> > > > > the problem.
> > > > >
> > > > > Since we have the WAL Compression feature, Replication has
> > introduced a
> > > > lot
> > > > > of new code, and there are many places that reset the HLog
> position,
> > > such
> > > > > as seekOnFs to originalPosition. I guess none of these codes
> consider
> > > > > compatibility with WAL Compression. Because theoretically we can
> roll
> > > > back
> > > > > the position to any position at any time, but the LRUCache in the
> > > > > corresponding LRUDictionary should also be rolled back, otherwise
> the
> > > > read
> > > > > and write link behavior may be inconsistent. But LRUCache can't
> roll
> > > back
> > > > > at all...
> > > > >
> > > > > So my thought is, open another issue and add some description in
> the
> > > doc,
> > > > > WAL Compression and Replication are not compatible.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Thank you. Regards
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

Reply via email to