[ https://issues.apache.org/jira/browse/HBASE-22539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Duo Zhang updated HBASE-22539: ------------------------------ Release Note: We found a critical bug which can lead to WAL corruption when Durability.ASYNC_WAL is used. The reason is that we release a ByteBuffer before actually persist the content into WAL file. The problem maybe lead to several errors, for example, ArrayIndexOfOutBounds when replaying WAL. This is because that the ByteBuffer is reused by others. ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY java.lang.ArrayIndexOutOfBoundsException: 18056 at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365) at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358) at org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735) at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816) at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143) at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195) at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100) And may even cause segmentation fault and crash the JVM directly. You will see a hs_err_pidXXX.log file and usually the problem is SIGSEGV. The problem has been reported several times in the past and this time Wellington Ramos Chevreuil provided the full logs and deeply analyzed the logs so we can find the root cause. And Lijin Bin figured out that the problem may only happen when Durability.ASYNC_WAL is used. Thanks to them. The problem only effects the 2.x releases, all users are highly recommand to upgrade to a release which has this fix in, especially that if you use Durability.ASYNC_WAL. was: We found a critical bug which can lead to WAL corruption when Durability.ASYNC_WAL is used. The reason is that we release a ByteBuffer before actually persist the content into WAL file. The problem maybe lead to several errors, for example, ArrayIndexOfOutBounds when replaying WAL. ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY java.lang.ArrayIndexOutOfBoundsException: 18056 at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365) at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358) at org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735) at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816) at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143) at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195) at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100) And may even cause segmentation fault and crash the JVM directly. You will see a hs_err_pidXXX.log file and usually the problem is SIGSEGV. The problem has been reported several times in the past and this time Wellington Ramos Chevreuil provided the full logs and deeply analyzed the logs so we can find the root cause. And Lijin Bin figured out that the problem may only happen when Durability.ASYNC_WAL is used. Thanks to them. The problem only effects the 2.x releases, all users are highly recommand to upgrade to a release which has this fix in, especially that if you use Durability.ASYNC_WAL. > WAL corruption due to early DBBs re-use when Durability.ASYNC_WAL is used > ------------------------------------------------------------------------- > > Key: HBASE-22539 > URL: https://issues.apache.org/jira/browse/HBASE-22539 > Project: HBase > Issue Type: Bug > Components: rpc, wal > Affects Versions: 2.2.0, 2.0.5, 2.1.5 > Reporter: Wellington Chevreuil > Assignee: Duo Zhang > Priority: Blocker > Fix For: 3.0.0, 2.3.0, 2.0.6, 2.2.1, 2.1.6 > > Attachments: HBASE-22539-UT.patch, HBASE-22539.branch-2.001.patch > > > Summary > We had been chasing a WAL corruption issue reported on one of our customers > deployments running release 2.1.1 (CDH 6.1.0). After providing a custom > modified jar with the extra sanity checks implemented by HBASE-21401 applied > on some code points, plus additional debugging messages, we believe it is > related to DirectByteBuffer usage, and Unsafe copy from offheap memory to > on-heap array triggered > [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java#L1157], > such as when writing into a non ByteBufferWriter type, as done > [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-common/src/main/java/org/apache/hadoop/hbase/io/ByteBufferWriterOutputStream.java#L84]. > More details on the following comment. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)