[jira] [Updated] (HBASE-22539) WAL corruption due to early DBBs re-use when Durability.ASYNC_WAL is used

Duo Zhang (JIRA) Mon, 05 Aug 2019 02:57:21 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-22539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Duo Zhang updated HBASE-22539:
------------------------------
    Release Note: 
We found a critical bug which can lead to WAL corruption when 
Durability.ASYNC_WAL is used. The reason is that we release a ByteBuffer before 
actually persist the content into WAL file.

The problem maybe lead to several errors, for example, ArrayIndexOfOutBounds 
when replaying WAL. This is because that the ByteBuffer is reused by others.

ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while 
processing event RS_LOG_REPLAY
java.lang.ArrayIndexOutOfBoundsException: 18056
        at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
        at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
        at 
org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
        at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
        at 
org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)
        at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)
        at 
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
        at 
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)
        at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100)

And may even cause segmentation fault and crash the JVM directly. You will see 
a hs_err_pidXXX.log file and usually the problem is SIGSEGV.

The problem has been reported several times in the past and this time 
Wellington Ramos Chevreuil provided the full logs and deeply analyzed the logs 
so we can find the root cause. And Lijin Bin figured out that the problem may 
only happen when Durability.ASYNC_WAL is used. Thanks to them.

The problem only effects the 2.x releases, all users are highly recommand to 
upgrade to a release which has this fix in, especially that if you use 
Durability.ASYNC_WAL.

  was:
We found a critical bug which can lead to WAL corruption when 
Durability.ASYNC_WAL is used. The reason is that we release a ByteBuffer before 
actually persist the content into WAL file.

The problem maybe lead to several errors, for example, ArrayIndexOfOutBounds 
when replaying WAL.

ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while 
processing event RS_LOG_REPLAY
java.lang.ArrayIndexOutOfBoundsException: 18056
        at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
        at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
        at 
org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
        at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
        at 
org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)
        at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)
        at 
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
        at 
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)
        at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100)

And may even cause segmentation fault and crash the JVM directly. You will see 
a hs_err_pidXXX.log file and usually the problem is SIGSEGV.

The problem has been reported several times in the past and this time 
Wellington Ramos Chevreuil provided the full logs and deeply analyzed the logs 
so we can find the root cause. And Lijin Bin figured out that the problem may 
only happen when Durability.ASYNC_WAL is used. Thanks to them.

The problem only effects the 2.x releases, all users are highly recommand to 
upgrade to a release which has this fix in, especially that if you use 
Durability.ASYNC_WAL.


> WAL corruption due to early DBBs re-use when Durability.ASYNC_WAL is used
> -------------------------------------------------------------------------
>
>                 Key: HBASE-22539
>                 URL: https://issues.apache.org/jira/browse/HBASE-22539
>             Project: HBase
>          Issue Type: Bug
>          Components: rpc, wal
>    Affects Versions: 2.2.0, 2.0.5, 2.1.5
>            Reporter: Wellington Chevreuil
>            Assignee: Duo Zhang
>            Priority: Blocker
>             Fix For: 3.0.0, 2.3.0, 2.0.6, 2.2.1, 2.1.6
>
>         Attachments: HBASE-22539-UT.patch, HBASE-22539.branch-2.001.patch
>
>
> Summary
> We had been chasing a WAL corruption issue reported on one of our customers 
> deployments running release 2.1.1 (CDH 6.1.0). After providing a custom 
> modified jar with the extra sanity checks implemented by HBASE-21401 applied 
> on some code points, plus additional debugging messages, we believe it is 
> related to DirectByteBuffer usage, and Unsafe copy from offheap memory to 
> on-heap array triggered 
> [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java#L1157],
>  such as when writing into a non ByteBufferWriter type, as done 
> [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-common/src/main/java/org/apache/hadoop/hbase/io/ByteBufferWriterOutputStream.java#L84].
> More details on the following comment.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (HBASE-22539) WAL corruption due to early DBBs re-use when Durability.ASYNC_WAL is used

Reply via email to