[ 
https://issues.apache.org/jira/browse/HBASE-27621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17686493#comment-17686493
 ] 

Duo Zhang commented on HBASE-27621:
-----------------------------------

Copy the comment on github issue here

{quote}
I tried to refactoring a bit but the implementation of ProtobufLogReader is too 
complicated. I think we'd better abstract two types of WAL.Reader for reading 
WAL file.
One is StreamingReader, which is used in most cases, for example, WAL 
splitting, WAL printing, etc, where we only need to read the file once and 
usually for closed WAL files. There is no need to support reset and seek.
The other is TailingReader, which is used by Replication, where we need to 
support reset and seek, and also we need to tell the upper layer whether we 
need to reset the compress context when calling reseting. The logic will be 
more complicated as we need to consider the requirements for tailing a WAL file 
which is currently being written.
The refactoring will be a bit big so I do not think we should apply it to 
branch-2.5 and branch-2.4. So let's apply the simple fix here and file another 
issue to implement the big refactoring.

Thanks.
{quote}

Let's use this issue to simply fix the issue first, which is less efficient, 
where we need to clear the compression context every time when resetting.

Thanks.

> Always use findEntry to fill the Dictionary when reading compressed WAL file
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-27621
>                 URL: https://issues.apache.org/jira/browse/HBASE-27621
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication, wal
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Critical
>             Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
>
> After trying several times, now I can reproduce a critical problem when 
> reading compressed WAL file in replication.
> The problem is about how we construct the LRUDictionary when reset the 
> WALEntryStream. In the current design, we will not reconstruct the 
> LRUDictionary when reseting, but when reading again, we will call addEntry 
> directly to add 'new' word into the dict, which will mess up the dict and 
> cause data corruption.
> I've implemented a UT to simulate reading partial WAL entry in replication, 
> with the current code base, after reseting and reading again, we will stuck 
> there for ever.
> The fix is to always use findEntry when constructing the dict when reading, 
> so we will not mess things up.
> Another possible fix is to always reconstruct the dict after reseting, we 
> will also clear the dict and reconstruct it again. But it is less efficient 
> as we need to read from the beginning to the position we want to seek to, 
> instead of seek to the position directly, especially when tailing the WAL 
> file which is currently being written.
> And notice that, the UT can only reproduce the problem in local file system, 
> on HDFS, the available method is implemented so if there is not enough data, 
> we will throw EOFException earlier before parsing cells with the compression 
> decoder, so we will not add  duplicated word to dict. But in real world, it 
> is possible that even if there are enough data to read, we could hit an 
> IOException while reading and lead to the same problem described above.
> And while fixing, I also found another problem that in TagConressionContext 
> and CompressionContext, we use the result of InputStream incorrectly, as we 
> just cast it to byte and test whether it is -1 to determine whether the field 
> is in the dict. The return value of InputStream.read is an int, and it will 
> return -1 if reaches EOF, but here we will consider it as not in dict... We 
> should throw EOFException instead.
> I'm not sure whether fix this can also fix HBASE-27073 but let's have a try 
> first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to