[ 
https://issues.apache.org/jira/browse/CASSANDRA-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Toews updated CASSANDRA-17025:
-----------------------------------
    Description: 
Version: 3.11.11

Cluster: 9 nodes (1 dc, 3 racks) on aws r5.4xlarge nodes (16 vCPU, 128GB mem)

Heap size: 20G

Direct memory buffer: 24G

 

We stood up this cluster a few months ago in order to migrate an old 2.0 
cluster. Since then, after about 7-10 days a node will begin to experience long 
old gc cycles  before logging many  "java.lang.OutOfMemoryError: Direct buffer 
memory" exceptions. When the long gc cycles start, the entire cluster becomes 
unresponsive (our application is unable to make queries to any node).

Restarting cassandra on the failing node resolves the problem, then we have to 
restart every other node in the cluster to prevent them from getting into the 
same state.

 

We have attempted to:
 * Increase -XX:MaxDirectMemorySize
 * Increase -Djdk.nio.maxCachedBufferSize
 * Update cassandra from 3.11.10 to 3.11.11

None of these have resolved the problem.

Since the last failure, we have increased -XX:MaxDirectMemorySize again, and 
are waiting to see if that has any effect.

 

Old gc collections from system.log:
{noformat}
INFO  [Service Thread] 2021-10-04 15:24:04,973 GCInspector.java:285 - G1 Old 
Generation GC in 4447ms.  Compressed Class Space: 6683064 -> 6677952; G1 Eden 
Space: 16777216 -> 0; G1 Old Gen: 5431375384 -> 745163912; G1 Survivor Space: 
419430400 -> 0; Metaspace: 54716768 ->
...
INFO  [Service Thread] 2021-10-04 15:24:06,985 GCInspector.java:285 - G1 Old 
Generation GC in 1901ms.  G1 Old Gen: 745163912 -> 745168360;
...
INFO  [Service Thread] 2021-10-04 15:24:09,306 GCInspector.java:285 - G1 Old 
Generation GC in 2046ms.  G1 Eden Space: 528482304 -> 0; G1 Old Gen: 759785184 
-> 761229864;
...
INFO  [Service Thread] 2021-10-04 15:24:14,749 GCInspector.java:285 - G1 Old 
Generation GC in 5403ms.  G1 Old Gen: 761229864 -> 762168640;
...
INFO  [Service Thread] 2021-10-04 15:24:16,782 GCInspector.java:285 - G1 Old 
Generation GC in 1949ms.  G1 Old Gen: 762168640 -> 762167640;
...
INFO  [Service Thread] 2021-10-04 15:25:09,406 GCInspector.java:285 - G1 Old 
Generation GC in 52302ms.  G1 Eden Space: 8388608 -> 0; G1 Old Gen: 762167640 
-> 762168160;
...
INFO  [Service Thread] 2021-10-04 15:25:15,011 GCInspector.java:285 - G1 Old 
Generation GC in 5522ms.  G1 Eden Space: 192937984 -> 0; G1 Old Gen: 762168160 
-> 770098088;
...
INFO  [Service Thread] 2021-10-04 15:25:31,453 GCInspector.java:285 - G1 Old 
Generation GC in 16310ms.  G1 Eden Space: 201326592 -> 0; G1 Old Gen: 770098088 
-> 769228400;
...
INFO  [Service Thread] 2021-10-04 15:25:33,597 GCInspector.java:285 - G1 Old 
Generation GC in 1984ms.  G1 Eden Space: 352321536 -> 0; G1 Old Gen: 750824952 
-> 751118968;
...
INFO  [Service Thread] 2021-10-04 15:25:50,152 GCInspector.java:285 - G1 Old 
Generation GC in 16411ms.  G1 Eden Space: 8388608 -> 0; G1 Old Gen: 751118968 
-> 751645056;
{noformat}
Example of direct memory oom from system.log
{noformat}
INFO  [ScheduledTasks:1] 2021-10-04 15:25:31,484 MessagingService.java:1246 - 
READ messages were dropped in last 5000 ms: 2 internal and 7 cross node. Mean 
internal dropped latency: 47238 ms and Mean cross-node dropped latency: 45171 ms
INFO  [ScheduledTasks:1] 2021-10-04 15:25:31,484 MessagingService.java:1246 - 
COUNTER_MUTATION messages were dropped in last 5000 ms: 1 internal and 60 cross 
node. Mean internal dropped latency: 7289 ms and Mean cross-node dropped 
latency: 9509 ms
ERROR [CounterMutationStage-464] 2021-10-04 15:25:31,484 
JVMStabilityInspector.java:94 - OutOfMemory error letting the JVM handle the 
error:
java.lang.OutOfMemoryError: Direct buffer memory
        at java.nio.Bits.reserveMemory(Bits.java:695) ~[na:1.8.0_292]
        at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) 
~[na:1.8.0_292]
        at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) 
~[na:1.8.0_292]
        at 
org.apache.cassandra.utils.memory.BufferPool.allocate(BufferPool.java:114) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.utils.memory.BufferPool.access$1000(BufferPool.java:50) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.utils.memory.BufferPool$LocalPool.allocate(BufferPool.java:408)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.utils.memory.BufferPool$LocalPool.access$000(BufferPool.java:335)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.utils.memory.BufferPool.takeFromPool(BufferPool.java:126) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at org.apache.cassandra.utils.memory.BufferPool.get(BufferPool.java:98) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at org.apache.cassandra.cache.ChunkCache.load(ChunkCache.java:156) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at org.apache.cassandra.cache.ChunkCache.load(ChunkCache.java:39) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalLoadingCache.lambda$new$0(BoundedLocalCache.java:2949)
 ~[caffeine-2.2.6.jar:na]
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$15(BoundedLocalCache.java:1807)
 ~[caffeine-2.2.6.jar:na]
        at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1892) 
~[na:1.8.0_292]
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:1805)
 ~[caffeine-2.2.6.jar:na]
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:1788)
 ~[caffeine-2.2.6.jar:na]
        at 
com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:97)
 ~[caffeine-2.2.6.jar:na]
        at 
com.github.benmanes.caffeine.cache.LocalLoadingCache.get(LocalLoadingCache.java:66)
 ~[caffeine-2.2.6.jar:na]
        at 
org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebuffer(ChunkCache.java:236)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebuffer(ChunkCache.java:214)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.io.util.RandomAccessReader.reBufferAt(RandomAccessReader.java:65)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.io.util.RandomAccessReader.seek(RandomAccessReader.java:207)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.io.util.FileHandle.createReader(FileHandle.java:150) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.io.sstable.format.SSTableReader.getFileDataInput(SSTableReader.java:1801)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.columniterator.AbstractSSTableIterator.<init>(AbstractSSTableIterator.java:103)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.columniterator.SSTableIterator.<init>(SSTableIterator.java:49)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.io.sstable.format.big.BigTableReader.iterator(BigTableReader.java:72)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.io.sstable.format.big.BigTableReader.iterator(BigTableReader.java:65)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.rows.UnfilteredRowIteratorWithLowerBound.initializeIterator(UnfilteredRowIteratorWithLowerBound.java:108)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.maybeInit(LazilyInitializedUnfilteredRowIterator.java:48)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.partitionLevelDeletion(LazilyInitializedUnfilteredRowIterator.java:81)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.rows.UnfilteredRowIteratorWithLowerBound.partitionLevelDeletion(UnfilteredRowIteratorWithLowerBound.java:170)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDiskInternal(SinglePartitionReadCommand.java:762)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDisk(SinglePartitionReadCommand.java:673)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.CounterMutation.updateWithCurrentValuesFromCFS(CounterMutation.java:258)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.CounterMutation.processModifications(CounterMutation.java:202)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.CounterMutation.applyCounterMutation(CounterMutation.java:123)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.service.StorageProxy$9.runMayThrow(StorageProxy.java:1650) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2771)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.service.StorageProxy$3.apply(StorageProxy.java:160) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:1201) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.service.StorageProxy.applyCounterMutationOnLeader(StorageProxy.java:1627)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.CounterMutationVerbHandler.doVerb(CounterMutationVerbHandler.java:48)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[na:1.8.0_292]
        at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
 [apache-cassandra-3.11.11.jar:3.11.11]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:113) 
[apache-cassandra-3.11.11.jar:3.11.11]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_292]
{noformat}
Example of full gc from from gc.log
{noformat}
2021-10-04T15:25:46.467+0000: 494369.005: [Full GC (System.gc())  
727M->726M(20480M), 1.8053704 secs]
   [Eden: 8192.0K(12288.0M)->0.0B(12288.0M) Survivors: 0.0B->0.0B Heap: 
727.3M(20480.0M)->726.2M(20480.0M)], [Metaspace: 53437K->53437K(1099776K)]
Heap after GC invocations=14150 (full 57):
 garbage-first heap   total 20971520K, used 743652K [0x00000002c0000000, 
0x00000002c0805000, 0x00000007c0000000)
  region size 8192K, 0 young (0K), 0 survivors (0K)
 Metaspace       used 53437K, capacity 55970K, committed 56528K, reserved 
1099776K
  class space    used 6522K, capacity 6987K, committed 7120K, reserved 1048576K
}
 [Times: user=2.99 sys=0.00, real=1.81 secs]
2021-10-04T15:25:48.273+0000: 494370.811: Total time for which application 
threads were stopped: 1.8099675 seconds, Stopping threads took: 0.0003245 
seconds
{Heap before GC invocations=14150 (full 57):
 garbage-first heap   total 20971520K, used 743652K [0x00000002c0000000, 
0x00000002c0805000, 0x00000007c0000000)
  region size 8192K, 1 young (8192K), 0 survivors (0K)
 Metaspace       used 53437K, capacity 55970K, committed 56528K, reserved 
1099776K
  class space    used 6522K, capacity 6987K, committed 7120K, reserved 1048576K
{noformat}

  was:
Version: 3.11.11

Cluster: 9 nodes (1 dc, 3 racks) on aws r5.4xlarge nodes (16 vCPU, 128GB mem)

Heap size: 20G

Direct memory buffer: 24G

 

We stood up this cluster a few months ago in order to migrate an old 2.1 
cluster. Since then, after about 7-10 days a node will begin to experience long 
old gc cycles  before logging many  "java.lang.OutOfMemoryError: Direct buffer 
memory" exceptions. When the long gc cycles start, the entire cluster becomes 
unresponsive (our application is unable to make queries to any node).

Restarting cassandra on the failing node resolves the problem, then we have to 
restart every other node in the cluster to prevent them from getting into the 
same state.

 

We have attempted to:
 * Increase -XX:MaxDirectMemorySize
 * Increase -Djdk.nio.maxCachedBufferSize
 * Update cassandra from 3.11.10 to 3.11.11

None of these have resolved the problem.

Since the last failure, we have increased -XX:MaxDirectMemorySize again, and 
are waiting to see if that has any effect.

 

Old gc collections from system.log:
{noformat}
INFO  [Service Thread] 2021-10-04 15:24:04,973 GCInspector.java:285 - G1 Old 
Generation GC in 4447ms.  Compressed Class Space: 6683064 -> 6677952; G1 Eden 
Space: 16777216 -> 0; G1 Old Gen: 5431375384 -> 745163912; G1 Survivor Space: 
419430400 -> 0; Metaspace: 54716768 ->
...
INFO  [Service Thread] 2021-10-04 15:24:06,985 GCInspector.java:285 - G1 Old 
Generation GC in 1901ms.  G1 Old Gen: 745163912 -> 745168360;
...
INFO  [Service Thread] 2021-10-04 15:24:09,306 GCInspector.java:285 - G1 Old 
Generation GC in 2046ms.  G1 Eden Space: 528482304 -> 0; G1 Old Gen: 759785184 
-> 761229864;
...
INFO  [Service Thread] 2021-10-04 15:24:14,749 GCInspector.java:285 - G1 Old 
Generation GC in 5403ms.  G1 Old Gen: 761229864 -> 762168640;
...
INFO  [Service Thread] 2021-10-04 15:24:16,782 GCInspector.java:285 - G1 Old 
Generation GC in 1949ms.  G1 Old Gen: 762168640 -> 762167640;
...
INFO  [Service Thread] 2021-10-04 15:25:09,406 GCInspector.java:285 - G1 Old 
Generation GC in 52302ms.  G1 Eden Space: 8388608 -> 0; G1 Old Gen: 762167640 
-> 762168160;
...
INFO  [Service Thread] 2021-10-04 15:25:15,011 GCInspector.java:285 - G1 Old 
Generation GC in 5522ms.  G1 Eden Space: 192937984 -> 0; G1 Old Gen: 762168160 
-> 770098088;
...
INFO  [Service Thread] 2021-10-04 15:25:31,453 GCInspector.java:285 - G1 Old 
Generation GC in 16310ms.  G1 Eden Space: 201326592 -> 0; G1 Old Gen: 770098088 
-> 769228400;
...
INFO  [Service Thread] 2021-10-04 15:25:33,597 GCInspector.java:285 - G1 Old 
Generation GC in 1984ms.  G1 Eden Space: 352321536 -> 0; G1 Old Gen: 750824952 
-> 751118968;
...
INFO  [Service Thread] 2021-10-04 15:25:50,152 GCInspector.java:285 - G1 Old 
Generation GC in 16411ms.  G1 Eden Space: 8388608 -> 0; G1 Old Gen: 751118968 
-> 751645056;
{noformat}
Example of direct memory oom from system.log
{noformat}
INFO  [ScheduledTasks:1] 2021-10-04 15:25:31,484 MessagingService.java:1246 - 
READ messages were dropped in last 5000 ms: 2 internal and 7 cross node. Mean 
internal dropped latency: 47238 ms and Mean cross-node dropped latency: 45171 ms
INFO  [ScheduledTasks:1] 2021-10-04 15:25:31,484 MessagingService.java:1246 - 
COUNTER_MUTATION messages were dropped in last 5000 ms: 1 internal and 60 cross 
node. Mean internal dropped latency: 7289 ms and Mean cross-node dropped 
latency: 9509 ms
ERROR [CounterMutationStage-464] 2021-10-04 15:25:31,484 
JVMStabilityInspector.java:94 - OutOfMemory error letting the JVM handle the 
error:
java.lang.OutOfMemoryError: Direct buffer memory
        at java.nio.Bits.reserveMemory(Bits.java:695) ~[na:1.8.0_292]
        at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) 
~[na:1.8.0_292]
        at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) 
~[na:1.8.0_292]
        at 
org.apache.cassandra.utils.memory.BufferPool.allocate(BufferPool.java:114) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.utils.memory.BufferPool.access$1000(BufferPool.java:50) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.utils.memory.BufferPool$LocalPool.allocate(BufferPool.java:408)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.utils.memory.BufferPool$LocalPool.access$000(BufferPool.java:335)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.utils.memory.BufferPool.takeFromPool(BufferPool.java:126) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at org.apache.cassandra.utils.memory.BufferPool.get(BufferPool.java:98) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at org.apache.cassandra.cache.ChunkCache.load(ChunkCache.java:156) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at org.apache.cassandra.cache.ChunkCache.load(ChunkCache.java:39) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalLoadingCache.lambda$new$0(BoundedLocalCache.java:2949)
 ~[caffeine-2.2.6.jar:na]
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$15(BoundedLocalCache.java:1807)
 ~[caffeine-2.2.6.jar:na]
        at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1892) 
~[na:1.8.0_292]
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:1805)
 ~[caffeine-2.2.6.jar:na]
        at 
com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:1788)
 ~[caffeine-2.2.6.jar:na]
        at 
com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:97)
 ~[caffeine-2.2.6.jar:na]
        at 
com.github.benmanes.caffeine.cache.LocalLoadingCache.get(LocalLoadingCache.java:66)
 ~[caffeine-2.2.6.jar:na]
        at 
org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebuffer(ChunkCache.java:236)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebuffer(ChunkCache.java:214)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.io.util.RandomAccessReader.reBufferAt(RandomAccessReader.java:65)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.io.util.RandomAccessReader.seek(RandomAccessReader.java:207)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.io.util.FileHandle.createReader(FileHandle.java:150) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.io.sstable.format.SSTableReader.getFileDataInput(SSTableReader.java:1801)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.columniterator.AbstractSSTableIterator.<init>(AbstractSSTableIterator.java:103)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.columniterator.SSTableIterator.<init>(SSTableIterator.java:49)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.io.sstable.format.big.BigTableReader.iterator(BigTableReader.java:72)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.io.sstable.format.big.BigTableReader.iterator(BigTableReader.java:65)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.rows.UnfilteredRowIteratorWithLowerBound.initializeIterator(UnfilteredRowIteratorWithLowerBound.java:108)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.maybeInit(LazilyInitializedUnfilteredRowIterator.java:48)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.partitionLevelDeletion(LazilyInitializedUnfilteredRowIterator.java:81)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.rows.UnfilteredRowIteratorWithLowerBound.partitionLevelDeletion(UnfilteredRowIteratorWithLowerBound.java:170)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDiskInternal(SinglePartitionReadCommand.java:762)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDisk(SinglePartitionReadCommand.java:673)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.CounterMutation.updateWithCurrentValuesFromCFS(CounterMutation.java:258)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.CounterMutation.processModifications(CounterMutation.java:202)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.CounterMutation.applyCounterMutation(CounterMutation.java:123)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.service.StorageProxy$9.runMayThrow(StorageProxy.java:1650) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2771)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.service.StorageProxy$3.apply(StorageProxy.java:160) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:1201) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.service.StorageProxy.applyCounterMutationOnLeader(StorageProxy.java:1627)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.db.CounterMutationVerbHandler.doVerb(CounterMutationVerbHandler.java:48)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) 
~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[na:1.8.0_292]
        at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
 ~[apache-cassandra-3.11.11.jar:3.11.11]
        at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
 [apache-cassandra-3.11.11.jar:3.11.11]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:113) 
[apache-cassandra-3.11.11.jar:3.11.11]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_292]
{noformat}
Example of full gc from from gc.log
{noformat}
2021-10-04T15:25:46.467+0000: 494369.005: [Full GC (System.gc())  
727M->726M(20480M), 1.8053704 secs]
   [Eden: 8192.0K(12288.0M)->0.0B(12288.0M) Survivors: 0.0B->0.0B Heap: 
727.3M(20480.0M)->726.2M(20480.0M)], [Metaspace: 53437K->53437K(1099776K)]
Heap after GC invocations=14150 (full 57):
 garbage-first heap   total 20971520K, used 743652K [0x00000002c0000000, 
0x00000002c0805000, 0x00000007c0000000)
  region size 8192K, 0 young (0K), 0 survivors (0K)
 Metaspace       used 53437K, capacity 55970K, committed 56528K, reserved 
1099776K
  class space    used 6522K, capacity 6987K, committed 7120K, reserved 1048576K
}
 [Times: user=2.99 sys=0.00, real=1.81 secs]
2021-10-04T15:25:48.273+0000: 494370.811: Total time for which application 
threads were stopped: 1.8099675 seconds, Stopping threads took: 0.0003245 
seconds
{Heap before GC invocations=14150 (full 57):
 garbage-first heap   total 20971520K, used 743652K [0x00000002c0000000, 
0x00000002c0805000, 0x00000007c0000000)
  region size 8192K, 1 young (8192K), 0 survivors (0K)
 Metaspace       used 53437K, capacity 55970K, committed 56528K, reserved 
1099776K
  class space    used 6522K, capacity 6987K, committed 7120K, reserved 1048576K
{noformat}


> Direct memory OOM 
> ------------------
>
>                 Key: CASSANDRA-17025
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17025
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Zach Toews
>            Priority: Normal
>
> Version: 3.11.11
> Cluster: 9 nodes (1 dc, 3 racks) on aws r5.4xlarge nodes (16 vCPU, 128GB mem)
> Heap size: 20G
> Direct memory buffer: 24G
>  
> We stood up this cluster a few months ago in order to migrate an old 2.0 
> cluster. Since then, after about 7-10 days a node will begin to experience 
> long old gc cycles  before logging many  "java.lang.OutOfMemoryError: Direct 
> buffer memory" exceptions. When the long gc cycles start, the entire cluster 
> becomes unresponsive (our application is unable to make queries to any node).
> Restarting cassandra on the failing node resolves the problem, then we have 
> to restart every other node in the cluster to prevent them from getting into 
> the same state.
>  
> We have attempted to:
>  * Increase -XX:MaxDirectMemorySize
>  * Increase -Djdk.nio.maxCachedBufferSize
>  * Update cassandra from 3.11.10 to 3.11.11
> None of these have resolved the problem.
> Since the last failure, we have increased -XX:MaxDirectMemorySize again, and 
> are waiting to see if that has any effect.
>  
> Old gc collections from system.log:
> {noformat}
> INFO  [Service Thread] 2021-10-04 15:24:04,973 GCInspector.java:285 - G1 Old 
> Generation GC in 4447ms.  Compressed Class Space: 6683064 -> 6677952; G1 Eden 
> Space: 16777216 -> 0; G1 Old Gen: 5431375384 -> 745163912; G1 Survivor Space: 
> 419430400 -> 0; Metaspace: 54716768 ->
> ...
> INFO  [Service Thread] 2021-10-04 15:24:06,985 GCInspector.java:285 - G1 Old 
> Generation GC in 1901ms.  G1 Old Gen: 745163912 -> 745168360;
> ...
> INFO  [Service Thread] 2021-10-04 15:24:09,306 GCInspector.java:285 - G1 Old 
> Generation GC in 2046ms.  G1 Eden Space: 528482304 -> 0; G1 Old Gen: 
> 759785184 -> 761229864;
> ...
> INFO  [Service Thread] 2021-10-04 15:24:14,749 GCInspector.java:285 - G1 Old 
> Generation GC in 5403ms.  G1 Old Gen: 761229864 -> 762168640;
> ...
> INFO  [Service Thread] 2021-10-04 15:24:16,782 GCInspector.java:285 - G1 Old 
> Generation GC in 1949ms.  G1 Old Gen: 762168640 -> 762167640;
> ...
> INFO  [Service Thread] 2021-10-04 15:25:09,406 GCInspector.java:285 - G1 Old 
> Generation GC in 52302ms.  G1 Eden Space: 8388608 -> 0; G1 Old Gen: 762167640 
> -> 762168160;
> ...
> INFO  [Service Thread] 2021-10-04 15:25:15,011 GCInspector.java:285 - G1 Old 
> Generation GC in 5522ms.  G1 Eden Space: 192937984 -> 0; G1 Old Gen: 
> 762168160 -> 770098088;
> ...
> INFO  [Service Thread] 2021-10-04 15:25:31,453 GCInspector.java:285 - G1 Old 
> Generation GC in 16310ms.  G1 Eden Space: 201326592 -> 0; G1 Old Gen: 
> 770098088 -> 769228400;
> ...
> INFO  [Service Thread] 2021-10-04 15:25:33,597 GCInspector.java:285 - G1 Old 
> Generation GC in 1984ms.  G1 Eden Space: 352321536 -> 0; G1 Old Gen: 
> 750824952 -> 751118968;
> ...
> INFO  [Service Thread] 2021-10-04 15:25:50,152 GCInspector.java:285 - G1 Old 
> Generation GC in 16411ms.  G1 Eden Space: 8388608 -> 0; G1 Old Gen: 751118968 
> -> 751645056;
> {noformat}
> Example of direct memory oom from system.log
> {noformat}
> INFO  [ScheduledTasks:1] 2021-10-04 15:25:31,484 MessagingService.java:1246 - 
> READ messages were dropped in last 5000 ms: 2 internal and 7 cross node. Mean 
> internal dropped latency: 47238 ms and Mean cross-node dropped latency: 45171 
> ms
> INFO  [ScheduledTasks:1] 2021-10-04 15:25:31,484 MessagingService.java:1246 - 
> COUNTER_MUTATION messages were dropped in last 5000 ms: 1 internal and 60 
> cross node. Mean internal dropped latency: 7289 ms and Mean cross-node 
> dropped latency: 9509 ms
> ERROR [CounterMutationStage-464] 2021-10-04 15:25:31,484 
> JVMStabilityInspector.java:94 - OutOfMemory error letting the JVM handle the 
> error:
> java.lang.OutOfMemoryError: Direct buffer memory
>         at java.nio.Bits.reserveMemory(Bits.java:695) ~[na:1.8.0_292]
>         at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) 
> ~[na:1.8.0_292]
>         at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) 
> ~[na:1.8.0_292]
>         at 
> org.apache.cassandra.utils.memory.BufferPool.allocate(BufferPool.java:114) 
> ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.utils.memory.BufferPool.access$1000(BufferPool.java:50) 
> ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.utils.memory.BufferPool$LocalPool.allocate(BufferPool.java:408)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.utils.memory.BufferPool$LocalPool.access$000(BufferPool.java:335)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.utils.memory.BufferPool.takeFromPool(BufferPool.java:126)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.utils.memory.BufferPool.get(BufferPool.java:98) 
> ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.cache.ChunkCache.load(ChunkCache.java:156) 
> ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.cache.ChunkCache.load(ChunkCache.java:39) 
> ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalLoadingCache.lambda$new$0(BoundedLocalCache.java:2949)
>  ~[caffeine-2.2.6.jar:na]
>         at 
> com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$15(BoundedLocalCache.java:1807)
>  ~[caffeine-2.2.6.jar:na]
>         at 
> java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1892) 
> ~[na:1.8.0_292]
>         at 
> com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:1805)
>  ~[caffeine-2.2.6.jar:na]
>         at 
> com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:1788)
>  ~[caffeine-2.2.6.jar:na]
>         at 
> com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:97)
>  ~[caffeine-2.2.6.jar:na]
>         at 
> com.github.benmanes.caffeine.cache.LocalLoadingCache.get(LocalLoadingCache.java:66)
>  ~[caffeine-2.2.6.jar:na]
>         at 
> org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebuffer(ChunkCache.java:236)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebuffer(ChunkCache.java:214)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.io.util.RandomAccessReader.reBufferAt(RandomAccessReader.java:65)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.io.util.RandomAccessReader.seek(RandomAccessReader.java:207)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.io.util.FileHandle.createReader(FileHandle.java:150) 
> ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.io.sstable.format.SSTableReader.getFileDataInput(SSTableReader.java:1801)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.db.columniterator.AbstractSSTableIterator.<init>(AbstractSSTableIterator.java:103)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.db.columniterator.SSTableIterator.<init>(SSTableIterator.java:49)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.io.sstable.format.big.BigTableReader.iterator(BigTableReader.java:72)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.io.sstable.format.big.BigTableReader.iterator(BigTableReader.java:65)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.db.rows.UnfilteredRowIteratorWithLowerBound.initializeIterator(UnfilteredRowIteratorWithLowerBound.java:108)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.maybeInit(LazilyInitializedUnfilteredRowIterator.java:48)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.partitionLevelDeletion(LazilyInitializedUnfilteredRowIterator.java:81)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.db.rows.UnfilteredRowIteratorWithLowerBound.partitionLevelDeletion(UnfilteredRowIteratorWithLowerBound.java:170)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDiskInternal(SinglePartitionReadCommand.java:762)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDisk(SinglePartitionReadCommand.java:673)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.db.CounterMutation.updateWithCurrentValuesFromCFS(CounterMutation.java:258)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.db.CounterMutation.processModifications(CounterMutation.java:202)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.db.CounterMutation.applyCounterMutation(CounterMutation.java:123)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.service.StorageProxy$9.runMayThrow(StorageProxy.java:1650)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2771)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.service.StorageProxy$3.apply(StorageProxy.java:160) 
> ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:1201)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.service.StorageProxy.applyCounterMutationOnLeader(StorageProxy.java:1627)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.db.CounterMutationVerbHandler.doVerb(CounterMutationVerbHandler.java:48)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) 
> ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_292]
>         at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
>  ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
>  [apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:113) 
> [apache-cassandra-3.11.11.jar:3.11.11]
>         at java.lang.Thread.run(Thread.java:748) [na:1.8.0_292]
> {noformat}
> Example of full gc from from gc.log
> {noformat}
> 2021-10-04T15:25:46.467+0000: 494369.005: [Full GC (System.gc())  
> 727M->726M(20480M), 1.8053704 secs]
>    [Eden: 8192.0K(12288.0M)->0.0B(12288.0M) Survivors: 0.0B->0.0B Heap: 
> 727.3M(20480.0M)->726.2M(20480.0M)], [Metaspace: 53437K->53437K(1099776K)]
> Heap after GC invocations=14150 (full 57):
>  garbage-first heap   total 20971520K, used 743652K [0x00000002c0000000, 
> 0x00000002c0805000, 0x00000007c0000000)
>   region size 8192K, 0 young (0K), 0 survivors (0K)
>  Metaspace       used 53437K, capacity 55970K, committed 56528K, reserved 
> 1099776K
>   class space    used 6522K, capacity 6987K, committed 7120K, reserved 
> 1048576K
> }
>  [Times: user=2.99 sys=0.00, real=1.81 secs]
> 2021-10-04T15:25:48.273+0000: 494370.811: Total time for which application 
> threads were stopped: 1.8099675 seconds, Stopping threads took: 0.0003245 
> seconds
> {Heap before GC invocations=14150 (full 57):
>  garbage-first heap   total 20971520K, used 743652K [0x00000002c0000000, 
> 0x00000002c0805000, 0x00000007c0000000)
>   region size 8192K, 1 young (8192K), 0 survivors (0K)
>  Metaspace       used 53437K, capacity 55970K, committed 56528K, reserved 
> 1099776K
>   class space    used 6522K, capacity 6987K, committed 7120K, reserved 
> 1048576K
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to