[jira] [Updated] (HBASE-28584) RS SIGSEGV under heavy replication load

Whitney Jackson (Jira) Thu, 09 May 2024 14:29:03 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-28584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Whitney Jackson updated HBASE-28584:
------------------------------------
    Description: 
I'm observing RS crashes under heavy replication load:

 
{code:java}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f7546873b69, pid=29890, tid=36828
#
# JRE version: Java(TM) SE Runtime Environment 18.9 (11.0.23+7) (build 
11.0.23+7-LTS-222)
# Java VM: Java HotSpot(TM) 64-Bit Server VM 18.9 (11.0.23+7-LTS-222, mixed 
mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 24625 c2 
org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
 (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209]
{code}
 

The heavier load comes when a replication peer has been disabled for several 
hours for patching etc. When the peer is re-enabled the replication load is 
high until the peer is all caught up. The crashes happen on the cluster 
receiving the replication edits.

 

I believe this problem started after upgrading from 2.4.x to 2.5.x.

 

One possibly relevant non-standard config I run with:
{code:java}
<property>
  <name>hbase.region.store.parallel.put.limit</name>
  <!-- Default: 10  -->
  <value>100</value>
  <description>Added after seeing "failed to accept edits" replication errors 
in the destination region servers indicating this limit was being exceeded 
while trying to process replication edits.</description>
</property>
{code}
 

I understand from other Jiras that the problem is likely around direct memory 
usage by Netty. I haven't yet tried switching the Netty allocator to 
{{unpooled}} or {{{}heap{}}}. I also haven't yet tried any of the  
{{io.netty.allocator.*}} options.

 

{{MaxDirectMemorySize}} is set to 26g.

 

Here's the full stack for the relevant thread:

 
{code:java}
Stack: [0x00007f72e2e5f000,0x00007f72e2f60000],  sp=0x00007f72e2f5e450,  free 
space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
J 24625 c2 
org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
 (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209]
J 26253 c2 
org.apache.hadoop.hbase.ByteBufferKeyValue.write(Ljava/io/OutputStream;Z)I (21 
bytes) @ 0x00007f7545af2d84 [0x00007f7545af2d20+0x0000000000000064]
J 22971 c2 
org.apache.hadoop.hbase.codec.KeyValueCodecWithTags$KeyValueEncoder.write(Lorg/apache/hadoop/hbase/Cell;)V
 (27 bytes) @ 0x00007f754663f700 [0x00007f754663f4c0+0x0000000000000240]
J 25251 c2 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (90 bytes) @ 0x00007f7546a53038 [0x00007f7546a50e60+0x00000000000021d8]
J 21182 c2 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (73 bytes) @ 0x00007f7545f4d90c [0x00007f7545f4d3a0+0x000000000000056c]
J 21181 c2 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(Ljava/lang/Object;ZLorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (149 bytes) @ 0x00007f7545fd680c [0x00007f7545fd65e0+0x000000000000022c]
J 25389 c2 org.apache.hadoop.hbase.ipc.NettyRpcConnection$$Lambda$247.run()V 
(16 bytes) @ 0x00007f7546ade660 [0x00007f7546ade140+0x0000000000000520]
J 24098 c2 
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(J)Z
 (109 bytes) @ 0x00007f754678fbb8 [0x00007f754678f8e0+0x00000000000002d8]
J 27297% c2 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run()V (603 
bytes) @ 0x00007f75466c4d48 [0x00007f75466c4c80+0x00000000000000c8]
j  
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run()V+44
j  
org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run()V+11
j  
org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run()V+4
J 12278 c1 java.lang.Thread.run()V java.base@11.0.23 (17 bytes) @ 
0x00007f753e11f084 [0x00007f753e11ef40+0x0000000000000144]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x85574a]  JavaCalls::call_helper(JavaValue*, methodHandle 
const&, JavaCallArguments*, Thread*)+0x27a
V  [libjvm.so+0x853d2e]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, 
Symbol*, Symbol*, Thread*)+0x19e
V  [libjvm.so+0x8ffddf]  thread_entry(JavaThread*, Thread*)+0x9f
V  [libjvm.so+0xdb68d1]  JavaThread::thread_main_inner()+0x131
V  [libjvm.so+0xdb2c4c]  Thread::call_run()+0x13c
V  [libjvm.so+0xc1f2e6]  thread_native_entry(Thread*)+0xe6
{code}
 

 

 

  was:
I'm observing RS crashes under heavy replication load:

 
{code:java}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f7546873b69, pid=29890, tid=36828
#
# JRE version: Java(TM) SE Runtime Environment 18.9 (11.0.23+7) (build 
11.0.23+7-LTS-222)
# Java VM: Java HotSpot(TM) 64-Bit Server VM 18.9 (11.0.23+7-LTS-222, mixed 
mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 24625 c2 
org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
 (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209]
{code}
 

The heavier load comes when a replication peer has been disabled for several 
hours for patching etc. When the peer is re-enabled the replication load is 
high until the peer is all caught up. The crashes happen on the cluster 
receiving the replication edits:

 

I believe this problem started after upgrading from 2.4.x to 2.5.x.

 

One possibly relevant non-standard config I run with:
{code:java}
<property>
  <name>hbase.region.store.parallel.put.limit</name>
  <!-- Default: 10  -->
  <value>100</value>
  <description>Added after seeing "failed to accept edits" replication errors 
in the destination region servers indicating this limit was being exceeded 
while trying to process replication edits.</description>
</property>
{code}
 

I understand from other Jiras that the problem is likely around direct memory 
usage by Netty. I haven't yet tried switching the Netty allocator to 
{{unpooled}} or {{{}heap{}}}. I also haven't yet tried any of the  
{{io.netty.allocator.*}} options.

 

{{MaxDirectMemorySize}} is set to 26g.

 

Here's the full stack for the relevant thread:

 
{code:java}
Stack: [0x00007f72e2e5f000,0x00007f72e2f60000],  sp=0x00007f72e2f5e450,  free 
space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
J 24625 c2 
org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
 (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209]
J 26253 c2 
org.apache.hadoop.hbase.ByteBufferKeyValue.write(Ljava/io/OutputStream;Z)I (21 
bytes) @ 0x00007f7545af2d84 [0x00007f7545af2d20+0x0000000000000064]
J 22971 c2 
org.apache.hadoop.hbase.codec.KeyValueCodecWithTags$KeyValueEncoder.write(Lorg/apache/hadoop/hbase/Cell;)V
 (27 bytes) @ 0x00007f754663f700 [0x00007f754663f4c0+0x0000000000000240]
J 25251 c2 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (90 bytes) @ 0x00007f7546a53038 [0x00007f7546a50e60+0x00000000000021d8]
J 21182 c2 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (73 bytes) @ 0x00007f7545f4d90c [0x00007f7545f4d3a0+0x000000000000056c]
J 21181 c2 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(Ljava/lang/Object;ZLorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (149 bytes) @ 0x00007f7545fd680c [0x00007f7545fd65e0+0x000000000000022c]
J 25389 c2 org.apache.hadoop.hbase.ipc.NettyRpcConnection$$Lambda$247.run()V 
(16 bytes) @ 0x00007f7546ade660 [0x00007f7546ade140+0x0000000000000520]
J 24098 c2 
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(J)Z
 (109 bytes) @ 0x00007f754678fbb8 [0x00007f754678f8e0+0x00000000000002d8]
J 27297% c2 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run()V (603 
bytes) @ 0x00007f75466c4d48 [0x00007f75466c4c80+0x00000000000000c8]
j  
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run()V+44
j  
org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run()V+11
j  
org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run()V+4
J 12278 c1 java.lang.Thread.run()V java.base@11.0.23 (17 bytes) @ 
0x00007f753e11f084 [0x00007f753e11ef40+0x0000000000000144]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x85574a]  JavaCalls::call_helper(JavaValue*, methodHandle 
const&, JavaCallArguments*, Thread*)+0x27a
V  [libjvm.so+0x853d2e]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, 
Symbol*, Symbol*, Thread*)+0x19e
V  [libjvm.so+0x8ffddf]  thread_entry(JavaThread*, Thread*)+0x9f
V  [libjvm.so+0xdb68d1]  JavaThread::thread_main_inner()+0x131
V  [libjvm.so+0xdb2c4c]  Thread::call_run()+0x13c
V  [libjvm.so+0xc1f2e6]  thread_native_entry(Thread*)+0xe6
{code}
 

 

 


> RS SIGSEGV under heavy replication load
> ---------------------------------------
>
>                 Key: HBASE-28584
>                 URL: https://issues.apache.org/jira/browse/HBASE-28584
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 2.5.6
>         Environment: RHEL 7.9
> JDK 11.0.23
> Hadoop 3.2.4
> Hbase 2.5.6
>            Reporter: Whitney Jackson
>            Priority: Major
>
> I'm observing RS crashes under heavy replication load:
>  
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f7546873b69, pid=29890, tid=36828
> #
> # JRE version: Java(TM) SE Runtime Environment 18.9 (11.0.23+7) (build 
> 11.0.23+7-LTS-222)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM 18.9 (11.0.23+7-LTS-222, mixed 
> mode, tiered, compressed oops, g1 gc, linux-amd64)
> # Problematic frame:
> # J 24625 c2 
> org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
>  (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209]
> {code}
>  
> The heavier load comes when a replication peer has been disabled for several 
> hours for patching etc. When the peer is re-enabled the replication load is 
> high until the peer is all caught up. The crashes happen on the cluster 
> receiving the replication edits.
>  
> I believe this problem started after upgrading from 2.4.x to 2.5.x.
>  
> One possibly relevant non-standard config I run with:
> {code:java}
> <property>
>   <name>hbase.region.store.parallel.put.limit</name>
>   <!-- Default: 10  -->
>   <value>100</value>
>   <description>Added after seeing "failed to accept edits" replication errors 
> in the destination region servers indicating this limit was being exceeded 
> while trying to process replication edits.</description>
> </property>
> {code}
>  
> I understand from other Jiras that the problem is likely around direct memory 
> usage by Netty. I haven't yet tried switching the Netty allocator to 
> {{unpooled}} or {{{}heap{}}}. I also haven't yet tried any of the  
> {{io.netty.allocator.*}} options.
>  
> {{MaxDirectMemorySize}} is set to 26g.
>  
> Here's the full stack for the relevant thread:
>  
> {code:java}
> Stack: [0x00007f72e2e5f000,0x00007f72e2f60000],  sp=0x00007f72e2f5e450,  free 
> space=1021k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> J 24625 c2 
> org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
>  (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209]
> J 26253 c2 
> org.apache.hadoop.hbase.ByteBufferKeyValue.write(Ljava/io/OutputStream;Z)I 
> (21 bytes) @ 0x00007f7545af2d84 [0x00007f7545af2d20+0x0000000000000064]
> J 22971 c2 
> org.apache.hadoop.hbase.codec.KeyValueCodecWithTags$KeyValueEncoder.write(Lorg/apache/hadoop/hbase/Cell;)V
>  (27 bytes) @ 0x00007f754663f700 [0x00007f754663f4c0+0x0000000000000240]
> J 25251 c2 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (90 bytes) @ 0x00007f7546a53038 [0x00007f7546a50e60+0x00000000000021d8]
> J 21182 c2 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (73 bytes) @ 0x00007f7545f4d90c [0x00007f7545f4d3a0+0x000000000000056c]
> J 21181 c2 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(Ljava/lang/Object;ZLorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (149 bytes) @ 0x00007f7545fd680c [0x00007f7545fd65e0+0x000000000000022c]
> J 25389 c2 org.apache.hadoop.hbase.ipc.NettyRpcConnection$$Lambda$247.run()V 
> (16 bytes) @ 0x00007f7546ade660 [0x00007f7546ade140+0x0000000000000520]
> J 24098 c2 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(J)Z
>  (109 bytes) @ 0x00007f754678fbb8 [0x00007f754678f8e0+0x00000000000002d8]
> J 27297% c2 
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run()V (603 
> bytes) @ 0x00007f75466c4d48 [0x00007f75466c4c80+0x00000000000000c8]
> j  
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run()V+44
> j  
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run()V+11
> j  
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run()V+4
> J 12278 c1 java.lang.Thread.run()V java.base@11.0.23 (17 bytes) @ 
> 0x00007f753e11f084 [0x00007f753e11ef40+0x0000000000000144]
> v  ~StubRoutines::call_stub
> V  [libjvm.so+0x85574a]  JavaCalls::call_helper(JavaValue*, methodHandle 
> const&, JavaCallArguments*, Thread*)+0x27a
> V  [libjvm.so+0x853d2e]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, 
> Symbol*, Symbol*, Thread*)+0x19e
> V  [libjvm.so+0x8ffddf]  thread_entry(JavaThread*, Thread*)+0x9f
> V  [libjvm.so+0xdb68d1]  JavaThread::thread_main_inner()+0x131
> V  [libjvm.so+0xdb2c4c]  Thread::call_run()+0x13c
> V  [libjvm.so+0xc1f2e6]  thread_native_entry(Thread*)+0xe6
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-28584) RS SIGSEGV under heavy replication load

Reply via email to