[DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-16 Thread 唐天航
Hi masters, I have created an issue HBASE-26849 about NPE caused by WAL Compression and Replication. For this problem, I try to reopen a WAL reader when we reset the position to 0 and it looks like it's working well. But it didn't fundamentally

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-16 Thread Duo Zhang
The old WAL compression implementation is buggy when used together with replication, that's true... But in general I think it is fixable, the dict is per file IIRC, so I think clearing the LRUCache when resetting to the head of the file can fix the problem? Maybe we need to do some refactoring...

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-16 Thread 唐天航
If we only reset the position to the head, yes we can fix it. In fact, 26849 is to fix the problem in this scenario. But unfortunately, we have some other scenarios where we roll back the position to some intermediate position, such as ProtobufLogReader.java#L381

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-16 Thread 唐天航
> > Maybe we need to do some refactoring... > I can not agree more... But before that, I think we'd better point out this compatibility issue explicitly in our doc. 唐天航 于2022年3月16日周三 18:08写道: > If we only reset the position to the head, yes we can fix it. > In fact, 26849 is to fix the problem i

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-16 Thread Duo Zhang
+1 on updating doc first. You can file an issue for the documentation change, and let's also send an NOTICE email to both dev and user list to warn our users about this. 唐天航 于2022年3月16日周三 18:08写道: > If we only reset the position to the head, yes we can fix it. > In fact, 26849 is to fix the prob

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-16 Thread Xiaolin Ha
We have also used replication+wal compression in our production clusters. It really seems unstable when replicating compressed cells. I'm digging with the log queue overstock issues, and pointed out one in issue HBASE-26843, which might be related to HBASE-26849. We can discuss more details if you

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-16 Thread 唐天航
No problem, I will read the details about HBASE-26843 tomorrow. And i think you could try HBASE-26849, but after all, my suggestion is close WAL Compression if you need to use replication... Xiaolin Ha 于2022年3月16日周三 22:40写道: > We have also used replication+wal compression in our production clust

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-16 Thread Andrew Purtell
When I did the WAL value compression work (HBASE-25869) I tested the resulting code with an integration scenario with replication active. I initially had an issue with unrecoverable errors during rewind, but it was due to an error in my implementation and was corrected (HBASE-25994). After that the

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-16 Thread Andrew Purtell
I wish email had an edit button. Let me rephrase I think what we need is an intermediate buffer that the reader fills with the WALedit contents. We read from the WAL stream into the buffer. If the reader is reset, the buffer is reset. Once the reader fully reads in a WALedit, it would pass the com

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-16 Thread 唐天航
Hi xiaolin, I have read HBASE-26843 just now. If you use WAL Compression, I think this issue might be related to HBASE-26849. When you get a NullPointerException or IndexOutOfBoundsException from LRUDictionary, ProtobufLogReader will wrap it as an EOFException and try to roll back the position an

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-16 Thread 唐天航
BTW, I think this issue might also be helpful for HBASE-15983. 唐天航 于2022年3月17日周四 10:33写道: > Hi xiaolin, > > I have read HBASE-26843 just now. > > If you use WAL Compression, I think this issue might be related to > HBASE-26849. > When you get a NullPointerException or IndexOutOfBoundsException f

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-16 Thread 唐天航
Hi Andrew, Thank you for sharing your detailed views. Originally my idea was to try to do a complete fix to this issue based on the existing implementation after working through this phase. But as we discussed before, LRUCache cannot be rolled back, so this may come at the cost of performance (ca

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-17 Thread 唐天航
Hi duo, I have submit a PR for the doc. Please kindly help me review it if it is convenient for you. Maybe need some polish. Thank you, Regards 张铎(Duo Zhang) 于2022年3月16日周三 18:36写道: > +1 on updating doc first. You can file an issue for the documentat

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-17 Thread Duo Zhang
I agree with Andrew that without a 'versioned' LRUCache, it is not easy to implement things correctly. And yes it will impact performance if we implement the 'versioned' logic, for example, using a buffer. But considering the real scenario, we do not always need to support rollback LRUCache. When

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-17 Thread 唐天航
If we can accept the performance degradation in the Replication process, then I have a solution that can completely solve this problem based on the existing implementation. See ReaderBase#seek: @Override public void seek(long pos) throws IOException { if (compressionContext != null && emptyCompr

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-17 Thread Duo Zhang
So I think the first step is to introduce different implementations for different scenarios. And for replication, we could use your above approach to solve the problem first, and then we could try to use Andrew's approach to optimize. WDYT? 唐天航 于2022年3月18日周五 11:50写道: > If we can accept the per

Re: [DISCUSS] HBASE-26849 NPE caused by WAL Compression and Replication

2022-03-17 Thread 唐天航
Agree. Then I will modify HBASE-26849 according to this plan, and if it goes well, it will be submitted to branch-1 first next week. Also, if Andrew doesn't mind, I'd be more than happy to be involved in the follow-up renovation. Maybe we can discuss the details later. Thank you. 张铎(Duo Zhang) 于