[jira] [Commented] (HBASE-22539) Potential WAL corruption due to Unsafe.copyMemory usage when DBB are in place

Wellington Chevreuil (JIRA) Tue, 04 Jun 2019 12:37:25 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-22539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856060#comment-16856060
 ]


Wellington Chevreuil commented on HBASE-22539:
----------------------------------------------

{quote}Oh, good, this is a nasty bug and seems we are about to reach the root 
cause. So the solution is to switch back from NettyRpcServer to 
SimpleRpcServer? So we will not pass cells with DBB?
{quote}
Yeah, that was the immediate workaround in this customer cluster.
{quote}And could you reproduce the problem?
{quote}
We couldn't, unfortunately. Not on any of our test clusters, nor even on 
customer's staging environment. This is currently only seen in their production 
cluster, consistently, once we switch back to Netty.
{quote}Could you please try moving the copy of the buf before the 
this.os.write(this.buf, 0, bytesToCopy);? It is also possible that something 
wrong inside the OutputStream implementation where changes the array?
{quote}
So you suspect *FSDataOutputStream* implementation might be changing source 
*this.buf*? I guess that would still cause this type of corruption even when 
not using netty, right?
{quote}And in general, both SimpleRpcServer and NettyRpcServer will use DBB, 
the difference is that, in SimpleRpcServer, the DBB is allocated by our own 
while in NettyRpcServer, it is a netty ByteBuf...
{quote}
Indeed, although there are conditions where BB will be allocated 
[here|https://github.com/wchevreuil/hbase/blob/HBASE-22539/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleServerRpcConnection.java#L269],
 I guess most of the time we'll be reaching [this 
condition|https://github.com/wchevreuil/hbase/blob/HBASE-22539/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/SimpleServerRpcConnection.java#L271],
 which gives us a DBB, as you mentioned. Ain't sure how differently is Netty 
allocating a DBB instance to the point a *Unsafe.copyMemory* call could damage 
the copy.

> Potential WAL corruption due to Unsafe.copyMemory usage when DBB are in place
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-22539
>                 URL: https://issues.apache.org/jira/browse/HBASE-22539
>             Project: HBase
>          Issue Type: Bug
>          Components: rpc, wal
>    Affects Versions: 2.1.1
>            Reporter: Wellington Chevreuil
>            Priority: Blocker
>
> Summary
> We had been chasing a WAL corruption issue reported on one of our customers 
> deployments running release 2.1.1 (CDH 6.1.0). After providing a custom 
> modified jar with the extra sanity checks implemented by HBASE-21401 applied 
> on some code points, plus additional debugging messages, we believe it is 
> related to DirectByteBuffer usage, and Unsafe copy from offheap memory to 
> on-heap array triggered 
> [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java#L1157],
>  such as when writing into a non ByteBufferWriter type, as done 
> [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-common/src/main/java/org/apache/hadoop/hbase/io/ByteBufferWriterOutputStream.java#L84].
> More details on the following comment.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-22539) Potential WAL corruption due to Unsafe.copyMemory usage when DBB are in place

Reply via email to