[jira] [Commented] (HBASE-26062) SIGSEGV in AsyncFSWAL consume

2021-07-21 Thread Anoop Sam John (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17385232#comment-17385232
 ] 

Anoop Sam John commented on HBASE-26062:


[~stack] no ASYNC_WAL durability?  If so we have other issues !!!

> SIGSEGV in AsyncFSWAL consume
> -
>
> Key: HBASE-26062
> URL: https://issues.apache.org/jira/browse/HBASE-26062
> Project: HBase
>  Issue Type: Bug
>Reporter: Michael Stack
>Priority: Major
>
> Seems related to the parent issue. Its happened a few times on one of our 
> clusters here. Below are two examples. Need more detail but perhaps the call 
> has timed out, the buffer has thus been freed, but the late consume on the 
> other side of the ringbuffer doesn't know that and goes ahead (Just 
> speculation).
>  
> {code:java}
> #  SIGSEGV (0xb) at pc=0x7f8b3ef5b77c, pid=37631, tid=0x7f61560ed700
> RAX=0xdf6e is an unknown valueRBX=0x7f8a38d7b6f8 is an 
> oopjava.nio.DirectByteBuffer - klass: 
> 'java/nio/DirectByteBuffer'RCX=0x7f60e2767898 is pointing into 
> metadataRDX=0x0de7 is an unknown valueRSP=0x7f61560ec6f0 is 
> pointing into the stack for thread: 0x7f8b3017b800RBP=[error occurred 
> during error reporting (printing register info), id 0xb]
> Stack: [0x7f6155fed000,0x7f61560ee000],  sp=0x7f61560ec6f0,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 23901 C2 
> java.util.stream.MatchOps$1MatchSink.accept(Ljava/lang/Object;)V (44 bytes) @ 
> 0x7f8b3ef5b77c [0x7f8b3ef5b640+0x13c]J 16165 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7f8b3d67b344 [0x7f8b3d67b2c0+0x84]J 16160 C2 
> java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
>  (7 bytes) @ 0x7f8b3d67bc9c [0x7f8b3d67b900+0x39c]J 17729 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V
>  (10 bytes) @ 0x7f8b3fc39010 [0x7f8b3fc388a0+0x770]J 29991 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 
> bytes) @ 0x7f8b3fd03d90 [0x7f8b3fd039e0+0x3b0]J 20773 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 bytes) @ 
> 0x7f8b40283728 [0x7f8b40283480+0x2a8]J 15191 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL$$Lambda$76.run()V (8 
> bytes) @ 0x7f8b3ed69ecc [0x7f8b3ed69ea0+0x2c]J 17383% C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x7f8b3d9423f8 [0x7f8b3d942260+0x198]j  
> java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5j  
> java.lang.Thread.run()V+11v  ~StubRoutines::call_stubV  [libjvm.so+0x66b9ba]  
> JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, 
> Thread*)+0xe1aV  [libjvm.so+0x669073]  JavaCalls::call_virtual(JavaValue*, 
> KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263V  
> [libjvm.so+0x669647]  JavaCalls::call_virtual(JavaValue*, Handle, 
> KlassHandle, Symbol*, Symbol*, Thread*)+0x57V  [libjvm.so+0x6aaa4c]  
> thread_entry(JavaThread*, Thread*)+0x6cV  [libjvm.so+0xa224cb]  
> JavaThread::thread_main_inner()+0xdbV  [libjvm.so+0xa22816]  
> JavaThread::run()+0x316V  [libjvm.so+0x8c4202]  java_start(Thread*)+0x102C  
> [libpthread.so.0+0x76ba]  start_thread+0xca {code}
>  
> This one is from a month previous and has a deeper stack... we're trying to 
> read a Cell...
>  
> {code:java}
> Stack: [0x7fa1d5fb8000,0x7fa1d60b9000],  sp=0x7fa1d60b7660,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 30665 C2 
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[BII)Z
>  (59 bytes) @ 0x7fcc2d29eeb2 [0x7fcc2d29e7c0+0x6f2]J 25816 C2 
> org.apache.hadoop.hbase.CellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[B)Z
>  (28 bytes) @ 0x7fcc2a0430f8 [0x7fcc2a0430e0+0x18]J 17236 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener$$Lambda$254.test(Ljava/lang/Object;)Z
>  (8 bytes) @ 0x7fcc2b40bc68 [0x7fcc2b40bc20+0x48]J 13735 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7fcc2b7d936c [0x7fcc2b7d92c0+0xac]J 17162 C2 
> java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
>  (7 bytes) @ 0x7fcc29bc05e8 [0x7fcc29bbfe80+0x768]J 16934 C2 
> org.apache.hadoop.hbase.replication.regionserver.Repl

[jira] [Commented] (HBASE-26062) SIGSEGV in AsyncFSWAL consume

2021-07-05 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17374957#comment-17374957
 ] 

Michael Stack commented on HBASE-26062:
---

{quote}...so should not be related to Scan/Lease expiry right?
{quote}
[~anoop.hbase] makes sense... thanks

It must be dbb compares. byte array compares are not going to backed w 
someone elses memory

i think heap usage high here.  i see some 1-3 second gc pausing.  I'm pretty 
sure no aync_wal but let me check (lots of clients...)  Thanks for taking a look

> SIGSEGV in AsyncFSWAL consume
> -
>
> Key: HBASE-26062
> URL: https://issues.apache.org/jira/browse/HBASE-26062
> Project: HBase
>  Issue Type: Bug
>Reporter: Michael Stack
>Priority: Major
>
> Seems related to the parent issue. Its happened a few times on one of our 
> clusters here. Below are two examples. Need more detail but perhaps the call 
> has timed out, the buffer has thus been freed, but the late consume on the 
> other side of the ringbuffer doesn't know that and goes ahead (Just 
> speculation).
>  
> {code:java}
> #  SIGSEGV (0xb) at pc=0x7f8b3ef5b77c, pid=37631, tid=0x7f61560ed700
> RAX=0xdf6e is an unknown valueRBX=0x7f8a38d7b6f8 is an 
> oopjava.nio.DirectByteBuffer - klass: 
> 'java/nio/DirectByteBuffer'RCX=0x7f60e2767898 is pointing into 
> metadataRDX=0x0de7 is an unknown valueRSP=0x7f61560ec6f0 is 
> pointing into the stack for thread: 0x7f8b3017b800RBP=[error occurred 
> during error reporting (printing register info), id 0xb]
> Stack: [0x7f6155fed000,0x7f61560ee000],  sp=0x7f61560ec6f0,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 23901 C2 
> java.util.stream.MatchOps$1MatchSink.accept(Ljava/lang/Object;)V (44 bytes) @ 
> 0x7f8b3ef5b77c [0x7f8b3ef5b640+0x13c]J 16165 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7f8b3d67b344 [0x7f8b3d67b2c0+0x84]J 16160 C2 
> java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
>  (7 bytes) @ 0x7f8b3d67bc9c [0x7f8b3d67b900+0x39c]J 17729 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V
>  (10 bytes) @ 0x7f8b3fc39010 [0x7f8b3fc388a0+0x770]J 29991 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 
> bytes) @ 0x7f8b3fd03d90 [0x7f8b3fd039e0+0x3b0]J 20773 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 bytes) @ 
> 0x7f8b40283728 [0x7f8b40283480+0x2a8]J 15191 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL$$Lambda$76.run()V (8 
> bytes) @ 0x7f8b3ed69ecc [0x7f8b3ed69ea0+0x2c]J 17383% C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x7f8b3d9423f8 [0x7f8b3d942260+0x198]j  
> java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5j  
> java.lang.Thread.run()V+11v  ~StubRoutines::call_stubV  [libjvm.so+0x66b9ba]  
> JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, 
> Thread*)+0xe1aV  [libjvm.so+0x669073]  JavaCalls::call_virtual(JavaValue*, 
> KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263V  
> [libjvm.so+0x669647]  JavaCalls::call_virtual(JavaValue*, Handle, 
> KlassHandle, Symbol*, Symbol*, Thread*)+0x57V  [libjvm.so+0x6aaa4c]  
> thread_entry(JavaThread*, Thread*)+0x6cV  [libjvm.so+0xa224cb]  
> JavaThread::thread_main_inner()+0xdbV  [libjvm.so+0xa22816]  
> JavaThread::run()+0x316V  [libjvm.so+0x8c4202]  java_start(Thread*)+0x102C  
> [libpthread.so.0+0x76ba]  start_thread+0xca {code}
>  
> This one is from a month previous and has a deeper stack... we're trying to 
> read a Cell...
>  
> {code:java}
> Stack: [0x7fa1d5fb8000,0x7fa1d60b9000],  sp=0x7fa1d60b7660,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 30665 C2 
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[BII)Z
>  (59 bytes) @ 0x7fcc2d29eeb2 [0x7fcc2d29e7c0+0x6f2]J 25816 C2 
> org.apache.hadoop.hbase.CellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[B)Z
>  (28 bytes) @ 0x7fcc2a0430f8 [0x7fcc2a0430e0+0x18]J 17236 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener$$Lambda$254.test(Ljava/lang/Object;)Z
>  (8 bytes) @ 0x7fcc2b40bc68 [0x7fcc2b40bc20+0x48]J 13735 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7fcc2b7d93

[jira] [Commented] (HBASE-26062) SIGSEGV in AsyncFSWAL consume

2021-07-04 Thread Anoop Sam John (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17374321#comment-17374321
 ] 

Anoop Sam John commented on HBASE-26062:


The mentioned traces are from write to WAL so should not be related to 
Scan/Lease expiry right?
The trace is till
org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily
After that whether its BB compare or byte[] any idea?
In some cases we have seen issue while doing BB compare when the heap usage is 
very high! Did not go into root of it.  So not fully sure whether this is 
related to pooled DBB corruption or not
BTW whether the writes to WAL is with Durability ASYNC_WAL?  In that case there 
is a possible bug. HBASE-24984.   

> SIGSEGV in AsyncFSWAL consume
> -
>
> Key: HBASE-26062
> URL: https://issues.apache.org/jira/browse/HBASE-26062
> Project: HBase
>  Issue Type: Bug
>Reporter: Michael Stack
>Priority: Major
>
> Seems related to the parent issue. Its happened a few times on one of our 
> clusters here. Below are two examples. Need more detail but perhaps the call 
> has timed out, the buffer has thus been freed, but the late consume on the 
> other side of the ringbuffer doesn't know that and goes ahead (Just 
> speculation).
>  
> {code:java}
> #  SIGSEGV (0xb) at pc=0x7f8b3ef5b77c, pid=37631, tid=0x7f61560ed700
> RAX=0xdf6e is an unknown valueRBX=0x7f8a38d7b6f8 is an 
> oopjava.nio.DirectByteBuffer - klass: 
> 'java/nio/DirectByteBuffer'RCX=0x7f60e2767898 is pointing into 
> metadataRDX=0x0de7 is an unknown valueRSP=0x7f61560ec6f0 is 
> pointing into the stack for thread: 0x7f8b3017b800RBP=[error occurred 
> during error reporting (printing register info), id 0xb]
> Stack: [0x7f6155fed000,0x7f61560ee000],  sp=0x7f61560ec6f0,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 23901 C2 
> java.util.stream.MatchOps$1MatchSink.accept(Ljava/lang/Object;)V (44 bytes) @ 
> 0x7f8b3ef5b77c [0x7f8b3ef5b640+0x13c]J 16165 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7f8b3d67b344 [0x7f8b3d67b2c0+0x84]J 16160 C2 
> java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
>  (7 bytes) @ 0x7f8b3d67bc9c [0x7f8b3d67b900+0x39c]J 17729 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V
>  (10 bytes) @ 0x7f8b3fc39010 [0x7f8b3fc388a0+0x770]J 29991 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 
> bytes) @ 0x7f8b3fd03d90 [0x7f8b3fd039e0+0x3b0]J 20773 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 bytes) @ 
> 0x7f8b40283728 [0x7f8b40283480+0x2a8]J 15191 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL$$Lambda$76.run()V (8 
> bytes) @ 0x7f8b3ed69ecc [0x7f8b3ed69ea0+0x2c]J 17383% C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x7f8b3d9423f8 [0x7f8b3d942260+0x198]j  
> java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5j  
> java.lang.Thread.run()V+11v  ~StubRoutines::call_stubV  [libjvm.so+0x66b9ba]  
> JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, 
> Thread*)+0xe1aV  [libjvm.so+0x669073]  JavaCalls::call_virtual(JavaValue*, 
> KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263V  
> [libjvm.so+0x669647]  JavaCalls::call_virtual(JavaValue*, Handle, 
> KlassHandle, Symbol*, Symbol*, Thread*)+0x57V  [libjvm.so+0x6aaa4c]  
> thread_entry(JavaThread*, Thread*)+0x6cV  [libjvm.so+0xa224cb]  
> JavaThread::thread_main_inner()+0xdbV  [libjvm.so+0xa22816]  
> JavaThread::run()+0x316V  [libjvm.so+0x8c4202]  java_start(Thread*)+0x102C  
> [libpthread.so.0+0x76ba]  start_thread+0xca {code}
>  
> This one is from a month previous and has a deeper stack... we're trying to 
> read a Cell...
>  
> {code:java}
> Stack: [0x7fa1d5fb8000,0x7fa1d60b9000],  sp=0x7fa1d60b7660,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 30665 C2 
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[BII)Z
>  (59 bytes) @ 0x7fcc2d29eeb2 [0x7fcc2d29e7c0+0x6f2]J 25816 C2 
> org.apache.hadoop.hbase.CellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[B)Z
>  (28 bytes) @ 0x7fcc2a0430f8 [0x7fcc2a0430e0+0x18]J 17236 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener$$Lambda$254.test(Ljava/lang/Object;)Z
>  (8 bytes) @ 0x0

[jira] [Commented] (HBASE-26062) SIGSEGV in AsyncFSWAL consume

2021-07-02 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373857#comment-17373857
 ] 

Michael Stack commented on HBASE-26062:
---

Only thing around the crash are these:

2021-07-01 02:31:57,075 INFO [regionserver/ps1366:16020.leaseChecker] 
regionserver.RSRpcServices: Scanner 6368110781038228097 lease expired on region 
X,a@\x00\x1F\xEC"\xC1\xD2\xA8nFrq\xE3P\xAD,1615345092618.8637491069310348cd6667c89af4b16b.

Close of the scanner releases buffer later used by AsyncFSWAL#consume?

> SIGSEGV in AsyncFSWAL consume
> -
>
> Key: HBASE-26062
> URL: https://issues.apache.org/jira/browse/HBASE-26062
> Project: HBase
>  Issue Type: Bug
>Reporter: Michael Stack
>Priority: Major
>
> Seems related to the parent issue. Its happened a few times on one of our 
> clusters here. Below are two examples. Need more detail but perhaps the call 
> has timed out, the buffer has thus been freed, but the late consume on the 
> other side of the ringbuffer doesn't know that and goes ahead (Just 
> speculation).
>  
> {code:java}
> #  SIGSEGV (0xb) at pc=0x7f8b3ef5b77c, pid=37631, tid=0x7f61560ed700
> RAX=0xdf6e is an unknown valueRBX=0x7f8a38d7b6f8 is an 
> oopjava.nio.DirectByteBuffer - klass: 
> 'java/nio/DirectByteBuffer'RCX=0x7f60e2767898 is pointing into 
> metadataRDX=0x0de7 is an unknown valueRSP=0x7f61560ec6f0 is 
> pointing into the stack for thread: 0x7f8b3017b800RBP=[error occurred 
> during error reporting (printing register info), id 0xb]
> Stack: [0x7f6155fed000,0x7f61560ee000],  sp=0x7f61560ec6f0,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 23901 C2 
> java.util.stream.MatchOps$1MatchSink.accept(Ljava/lang/Object;)V (44 bytes) @ 
> 0x7f8b3ef5b77c [0x7f8b3ef5b640+0x13c]J 16165 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7f8b3d67b344 [0x7f8b3d67b2c0+0x84]J 16160 C2 
> java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
>  (7 bytes) @ 0x7f8b3d67bc9c [0x7f8b3d67b900+0x39c]J 17729 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V
>  (10 bytes) @ 0x7f8b3fc39010 [0x7f8b3fc388a0+0x770]J 29991 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 
> bytes) @ 0x7f8b3fd03d90 [0x7f8b3fd039e0+0x3b0]J 20773 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 bytes) @ 
> 0x7f8b40283728 [0x7f8b40283480+0x2a8]J 15191 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL$$Lambda$76.run()V (8 
> bytes) @ 0x7f8b3ed69ecc [0x7f8b3ed69ea0+0x2c]J 17383% C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x7f8b3d9423f8 [0x7f8b3d942260+0x198]j  
> java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5j  
> java.lang.Thread.run()V+11v  ~StubRoutines::call_stubV  [libjvm.so+0x66b9ba]  
> JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, 
> Thread*)+0xe1aV  [libjvm.so+0x669073]  JavaCalls::call_virtual(JavaValue*, 
> KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263V  
> [libjvm.so+0x669647]  JavaCalls::call_virtual(JavaValue*, Handle, 
> KlassHandle, Symbol*, Symbol*, Thread*)+0x57V  [libjvm.so+0x6aaa4c]  
> thread_entry(JavaThread*, Thread*)+0x6cV  [libjvm.so+0xa224cb]  
> JavaThread::thread_main_inner()+0xdbV  [libjvm.so+0xa22816]  
> JavaThread::run()+0x316V  [libjvm.so+0x8c4202]  java_start(Thread*)+0x102C  
> [libpthread.so.0+0x76ba]  start_thread+0xca {code}
>  
> This one is from a month previous and has a deeper stack... we're trying to 
> read a Cell...
>  
> {code:java}
> Stack: [0x7fa1d5fb8000,0x7fa1d60b9000],  sp=0x7fa1d60b7660,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 30665 C2 
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[BII)Z
>  (59 bytes) @ 0x7fcc2d29eeb2 [0x7fcc2d29e7c0+0x6f2]J 25816 C2 
> org.apache.hadoop.hbase.CellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[B)Z
>  (28 bytes) @ 0x7fcc2a0430f8 [0x7fcc2a0430e0+0x18]J 17236 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener$$Lambda$254.test(Ljava/lang/Object;)Z
>  (8 bytes) @ 0x7fcc2b40bc68 [0x7fcc2b40bc20+0x48]J 13735 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7fcc2b7d936c [0x0

[jira] [Commented] (HBASE-26062) SIGSEGV in AsyncFSWAL consume

2021-07-02 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373806#comment-17373806
 ] 

Michael Stack commented on HBASE-26062:
---

Made this an issue. I don't think it related to HBASE-26042. I think it 
something else.

TODO: See if rpc timeouts around time of this crash. If so, try and see if the 
inbound rpc had a trailing cellblock sidecar. If so, perhaps this code suspect 
in ServerRpcConnection:
{code:java}
if (header.hasCellBlockMeta()) {
  buf.position(offset);
  ByteBuff dup = buf.duplicate();
  dup.limit(offset + header.getCellBlockMeta().getLength());
  cellScanner = this.rpcServer.cellBlockBuilder.createCellScannerReusingBuffers(
  this.codec, this.compressionCodec, dup);
} {code}

> SIGSEGV in AsyncFSWAL consume
> -
>
> Key: HBASE-26062
> URL: https://issues.apache.org/jira/browse/HBASE-26062
> Project: HBase
>  Issue Type: Bug
>Reporter: Michael Stack
>Priority: Major
>
> Seems related to the parent issue. Its happened a few times on one of our 
> clusters here. Below are two examples. Need more detail but perhaps the call 
> has timed out, the buffer has thus been freed, but the late consume on the 
> other side of the ringbuffer doesn't know that and goes ahead (Just 
> speculation).
>  
> {code:java}
> #  SIGSEGV (0xb) at pc=0x7f8b3ef5b77c, pid=37631, tid=0x7f61560ed700
> RAX=0xdf6e is an unknown valueRBX=0x7f8a38d7b6f8 is an 
> oopjava.nio.DirectByteBuffer - klass: 
> 'java/nio/DirectByteBuffer'RCX=0x7f60e2767898 is pointing into 
> metadataRDX=0x0de7 is an unknown valueRSP=0x7f61560ec6f0 is 
> pointing into the stack for thread: 0x7f8b3017b800RBP=[error occurred 
> during error reporting (printing register info), id 0xb]
> Stack: [0x7f6155fed000,0x7f61560ee000],  sp=0x7f61560ec6f0,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 23901 C2 
> java.util.stream.MatchOps$1MatchSink.accept(Ljava/lang/Object;)V (44 bytes) @ 
> 0x7f8b3ef5b77c [0x7f8b3ef5b640+0x13c]J 16165 C2 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(Ljava/util/function/Consumer;)Z
>  (79 bytes) @ 0x7f8b3d67b344 [0x7f8b3d67b2c0+0x84]J 16160 C2 
> java.util.stream.MatchOps$MatchOp.evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
>  (7 bytes) @ 0x7f8b3d67bc9c [0x7f8b3d67b900+0x39c]J 17729 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.visitLogEntryBeforeWrite(Lorg/apache/hadoop/hbase/wal/WALKey;Lorg/apache/hadoop/hbase/wal/WALEdit;)V
>  (10 bytes) @ 0x7f8b3fc39010 [0x7f8b3fc388a0+0x770]J 29991 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync()V (261 
> bytes) @ 0x7f8b3fd03d90 [0x7f8b3fd039e0+0x3b0]J 20773 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume()V (474 bytes) @ 
> 0x7f8b40283728 [0x7f8b40283480+0x2a8]J 15191 C2 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL$$Lambda$76.run()V (8 
> bytes) @ 0x7f8b3ed69ecc [0x7f8b3ed69ea0+0x2c]J 17383% C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x7f8b3d9423f8 [0x7f8b3d942260+0x198]j  
> java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5j  
> java.lang.Thread.run()V+11v  ~StubRoutines::call_stubV  [libjvm.so+0x66b9ba]  
> JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, 
> Thread*)+0xe1aV  [libjvm.so+0x669073]  JavaCalls::call_virtual(JavaValue*, 
> KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x263V  
> [libjvm.so+0x669647]  JavaCalls::call_virtual(JavaValue*, Handle, 
> KlassHandle, Symbol*, Symbol*, Thread*)+0x57V  [libjvm.so+0x6aaa4c]  
> thread_entry(JavaThread*, Thread*)+0x6cV  [libjvm.so+0xa224cb]  
> JavaThread::thread_main_inner()+0xdbV  [libjvm.so+0xa22816]  
> JavaThread::run()+0x316V  [libjvm.so+0x8c4202]  java_start(Thread*)+0x102C  
> [libpthread.so.0+0x76ba]  start_thread+0xca {code}
>  
> This one is from a month previous and has a deeper stack... we're trying to 
> read a Cell...
>  
> {code:java}
> Stack: [0x7fa1d5fb8000,0x7fa1d60b9000],  sp=0x7fa1d60b7660,  free 
> space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
> C=native code)J 30665 C2 
> org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[BII)Z
>  (59 bytes) @ 0x7fcc2d29eeb2 [0x7fcc2d29e7c0+0x6f2]J 25816 C2 
> org.apache.hadoop.hbase.CellUtil.matchingFamily(Lorg/apache/hadoop/hbase/Cell;[B)Z
>  (28 bytes) @ 0x7fcc2a0430f8 [0x7fcc2a0430e0+0x18]J 17236 C2 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener$$Lambda$254.test(L