[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

Manish Khandelwal (Jira) Wed, 14 Feb 2024 22:30:04 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817590#comment-17817590
 ]


Manish Khandelwal commented on CASSANDRA-18762:
-----------------------------------------------

We are also getting the same issue on multi DC setup. Though in single DC 
things run fine for 11 nodes. But once another DC is addded it starts to fail 
pretty quickly. Getting the same error as mentioned in the issue here. Running 
repair table wise seems to be successful most of the times. But on keyspace 
level repairs always fails for one of the keyspace. This keyspace has three 
tables, all STCS with one table having almost no data. Tried setting 
*-XX:MaxDirectMemorySize* but results are same, i.e., getting out of memory. We 
are on java8. and Cassandra 4.0.10. I think with multi DC should be easy to 
reproduce.

> Repair triggers OOM with direct buffer memory
> ---------------------------------------------
>
>                 Key: CASSANDRA-18762
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18762
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair
>            Reporter: Brad Schoening
>            Priority: Normal
>              Labels: OutOfMemoryError
>         Attachments: Cluster-dm-metrics-1.PNG, 
> image-2023-12-06-15-28-05-459.png, image-2023-12-06-15-29-31-491.png, 
> image-2023-12-06-15-58-55-007.png
>
>
> We are seeing repeated failures of nodes with 16GB of heap on a VM with 32GB 
> of physical RAM due to direct memory.  This seems to be related to 
> CASSANDRA-15202 which moved Merkel trees off-heap in 4.0.   Using Cassandra 
> 4.0.6 with Java 11.
> {noformat}
> 2023-08-09 04:30:57,470 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e55a3b0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_a from 
> /169.102.200.241:7000
> 2023-08-09 04:30:57,567 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e0d2900-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.93.192.29:7000
> 2023-08-09 04:30:57,568 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e1dcad0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_c from 
> /169.104.171.134:7000
> 2023-08-09 04:30:57,591 [INFO ] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 RepairSession.java:202 - [repair 
> #5e69a0e0-366d-11ee-a644-d91df26add5e] Received merkle tree for table_b from 
> /169.79.232.67:7000
> 2023-08-09 04:30:57,876 [INFO ] [Service Thread] cluster_id=101 
> ip_address=169.0.0.1 GCInspector.java:294 - G1 Old Generation GC in 282ms. 
> Compressed Class Space: 8444560 -> 8372152; G1 Eden Space: 7809794048 -> 0; 
> G1 Old Gen: 1453478400 -> 820942800; G1 Survivor Space: 419430400 -> 0; 
> Metaspace: 80411136 -> 80176528
> 2023-08-09 04:30:58,387 [ERROR] [AntiEntropyStage:1] cluster_id=101 
> ip_address=169.0.0.1 JVMStabilityInspector.java:102 - OutOfMemory error 
> letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at org.apache.cassandra.utils.MerkleTree.allocate(MerkleTree.java:742)
> at 
> org.apache.cassandra.utils.MerkleTree.deserializeOffHeap(MerkleTree.java:780)
> at org.apache.cassandra.utils.MerkleTree.deserializeTree(MerkleTree.java:751)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:720)
> at org.apache.cassandra.utils.MerkleTree.deserialize(MerkleTree.java:698)
> at 
> org.apache.cassandra.utils.MerkleTrees$MerkleTreesSerializer.deserialize(MerkleTrees.java:416)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:100)
> at 
> org.apache.cassandra.repair.messages.ValidationResponse$1.deserialize(ValidationResponse.java:84)
> at 
> org.apache.cassandra.net.Message$Serializer.deserializePost40(Message.java:782)
> at org.apache.cassandra.net.Message$Serializer.deserialize(Message.java:642)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.deserialize(InboundMessageHandler.java:364)
> at 
> org.apache.cassandra.net.InboundMessageHandler$LargeMessage.access$1100(InboundMessageHandler.java:317)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessLargeMessage.provideMessage(InboundMessageHandler.java:504)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:429)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)no* further _formatting_ is 
> done here{noformat}
>  
> -XX:+AlwaysPreTouch
> -XX:+CrashOnOutOfMemoryError
> -XX:+ExitOnOutOfMemoryError
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:+ParallelRefProcEnabled
> -XX:+PerfDisableSharedMem
> -XX:+ResizeTLAB
> -XX:+UseG1GC
> -XX:+UseNUMA
> -XX:+UseTLAB
> -XX:+UseThreadPriorities
> -XX:-UseBiasedLocking
> -XX:CompileCommandFile=/opt/nosql/clusters/cassandra-101/conf/hotspot_compiler
> -XX:G1RSetUpdatingPauseTimePercent=5
> -XX:G1ReservePercent=20
> -XX:HeapDumpPath=/opt/nosql/data/cluster_101/cassandra-1691623098-pid2804737.hprof
> -XX:InitiatingHeapOccupancyPercent=70
> -XX:MaxGCPauseMillis=200
> -XX:StringTableSize=60013
> -Xlog:gc*:file=/opt/nosql/clusters/cassandra-101/logs/gc.log:time,uptime:filecount=10,filesize=10485760
> -Xms16G
> -Xmx16G
> -Xss256k
>  
> From our Prometheus metrics, the behavior shows the direct buffer memory 
> ramping up until it reaches the max and then causes an OOM.  It would appear 
> that direct memory is never being released by the JVM until its exhausted.
>  
> !Cluster-dm-metrics.PNG!
> An Eclipse Memory Analyzer
> Class Histogram:
> ||Class Name||Objects||Shallow Heap||Retained Heap||
> |java.lang.Object[]|445,014|42,478,160|>= 4,603,280,344| |
> |io.netty.util.concurrent.FastThreadLocalThread|167|21,376|>= 4,467,294,736|
> Leaks: Problem Suspect 1
> The thread *io.netty.util.concurrent.FastThreadLocalThread @ 0x501dd5930 
> AntiEntropyStage:1* keeps local variables with total size *4,295,042,472 
> (84.00%)* bytes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18762) Repair triggers OOM with direct buffer memory

Reply via email to