[jira] [Commented] (HBASE-26092) JVM core dump in the replication path

Anoop Sam John (Jira) Thu, 05 Aug 2021 01:34:16 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-26092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393718#comment-17393718
 ]


Anoop Sam John commented on HBASE-26092:
----------------------------------------

One doubt now is this
In ReplicationSink side, we override the default client configs of retries and 
op timeout to be 4 and 10 sec default.  So after 10 sec, the op will get 
timedout and the table.batch() call will comeout failed and end this RPC and 
possibly release the BB.  In Netty RPC client side 
NettyRpcConnection#sendRequest0 is executed by another thread and this may 
operate on an already delayed next batch call (That also timedout now)..  Will 
try with some hacks in code to repro the issue 1st to assert the theory.  Am 
still on top of this issue.

> JVM core dump in the replication path
> -------------------------------------
>
>                 Key: HBASE-26092
>                 URL: https://issues.apache.org/jira/browse/HBASE-26092
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.3.5
>            Reporter: Huaxiang Sun
>            Priority: Critical
>
> When replication is turned on, we found the following code dump in the region 
> server. 
> I checked the code dump for replication. I think I got some ideas. For 
> replication, when RS receives walEdits from remote cluster, it needs to send 
> them out to final RS. In this case, NettyRpcConnection is deployed, calls are 
> queued while it refers to ByteBuffer in the context of replicationHandler 
> (returned to the pool once it returns). Code dump will happen since the 
> byteBuffer has been reused. Needs ref count in this asynchronous processing.
>  
> Feel free to take it, otherwise, I will try to work on a patch later.
>  
>  
> {code:java}
> Stack: [0x00007fb1bf039000,0x00007fb1bf13a000],  sp=0x00007fb1bf138560,  free 
> space=1021k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> J 28175 C2 
> org.apache.hadoop.hbase.ByteBufferKeyValue.write(Ljava/io/OutputStream;Z)I 
> (21 bytes) @ 0x00007fdbbbb2663c [0x00007fdbbbb263c0+0x27c]
> J 14912 C2 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.writeRequest(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Lorg/apache/hadoop/hbase/ipc/Call;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (370 bytes) @ 0x00007fdbbb94b590 [0x00007fdbbb949c00+0x1990]
> J 14911 C2 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (30 bytes) @ 0x00007fdbb972d1d4 [0x00007fdbb972d1a0+0x34]
> J 30476 C2 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(Ljava/lang/Object;ZLorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (149 bytes) @ 0x00007fdbbd4e7084 [0x00007fdbbd4e6900+0x784]
> J 14914 C2 org.apache.hadoop.hbase.ipc.NettyRpcConnection$6$1.run()V (22 
> bytes) @ 0x00007fdbbb9344ec [0x00007fdbbb934280+0x26c]
> J 23528 C2 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(J)Z
>  (106 bytes) @ 0x00007fdbbcbb0efc [0x00007fdbbcbb0c40+0x2bc]
> J 15987% C2 
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run()V (461 
> bytes) @ 0x00007fdbbbaf1580 [0x00007fdbbbaf1360+0x220]
> j  
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run()V+44
> j  
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run()V+11
> j  
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run()V+4
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-26092) JVM core dump in the replication path

Reply via email to