[ https://issues.apache.org/jira/browse/HBASE-22539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856060#comment-16856060 ]
Wellington Chevreuil commented on HBASE-22539: ---------------------------------------------- {quote}Oh, good, this is a nasty bug and seems we are about to reach the root cause. So the solution is to switch back from NettyRpcServer to SimpleRpcServer? So we will not pass cells with DBB? {quote} Yeah, that was the immediate workaround in this customer cluster. {quote}And could you reproduce the problem? {quote} We couldn't, unfortunately. Not on any of our test clusters, nor even on customer's staging environment. This is currently only seen in their production cluster, consistently, once we switch back to Netty. {quote}Could you please try moving the copy of the buf before the this.os.write(this.buf, 0, bytesToCopy);? It is also possible that something wrong inside the OutputStream implementation where changes the array? {quote} So you suspect *FSDataOutputStream* implementation might be changing source *this.buf*? I guess that would still cause this type of corruption even when not using netty, right? {quote}And in general, both SimpleRpcServer and NettyRpcServer will use DBB, the difference is that, in SimpleRpcServer, the DBB is allocated by our own while in NettyRpcServer, it is a netty ByteBuf... {quote} Indeed, although there are conditions where BB will be allocated [here|https://github.com/wchevreuil/hbase/blob/HBASE-22539/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleServerRpcConnection.java#L269], I guess most of the time we'll be reaching [this condition|https://github.com/wchevreuil/hbase/blob/HBASE-22539/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleServerRpcConnection.java#L271], which gives us a DBB, as you mentioned. Ain't sure how differently is Netty allocating a DBB instance to the point a *Unsafe.copyMemory* call could damage the copy. > Potential WAL corruption due to Unsafe.copyMemory usage when DBB are in place > ----------------------------------------------------------------------------- > > Key: HBASE-22539 > URL: https://issues.apache.org/jira/browse/HBASE-22539 > Project: HBase > Issue Type: Bug > Components: rpc, wal > Affects Versions: 2.1.1 > Reporter: Wellington Chevreuil > Priority: Blocker > > Summary > We had been chasing a WAL corruption issue reported on one of our customers > deployments running release 2.1.1 (CDH 6.1.0). After providing a custom > modified jar with the extra sanity checks implemented by HBASE-21401 applied > on some code points, plus additional debugging messages, we believe it is > related to DirectByteBuffer usage, and Unsafe copy from offheap memory to > on-heap array triggered > [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java#L1157], > such as when writing into a non ByteBufferWriter type, as done > [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-common/src/main/java/org/apache/hadoop/hbase/io/ByteBufferWriterOutputStream.java#L84]. > More details on the following comment. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)