[ https://issues.apache.org/jira/browse/CASSANDRA-14355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992394#comment-16992394 ]
Chris Kistner edited comment on CASSANDRA-14355 at 12/10/19 10:17 AM: ---------------------------------------------------------------------- We have now experienced an issue that might be related to this, however our Cassandra did not crash yet - it just had frequent (every ~ 2 minutes) "ConcurrentMarkSweep GC" events of 16+ seconds! eg.: {noformat} WARN [Service Thread] 2019-12-10 08:03:19,969 GCInspector.java:282 - ConcurrentMarkSweep GC in 19129ms. CMS Old Gen: 7547650016 -> 7547650048; Par Eden Space: 671088640 -> 251798544; Par Survivor Space: 83886048 -> 0 WARN [Service Thread] 2019-12-10 08:03:37,565 GCInspector.java:282 - ConcurrentMarkSweep GC in 16379ms. Par Eden Space: 671088640 -> 254509608; Par Survivor Space: 83886032 -> 0{noformat} Sometimes it went back down to 200ms again, and after we did a "nodetool drain" and then removed the node from the cluster the GC time remained sub 250ms. Our setup is: * 5 nodes in dc1, 5 nodes in dc2. * RF: dc1=5, dc2=5 * CL = Local Quorum * Host with 32GB of RAM -> Cassandra allocates 8GB to heap * Java version: java-1.8.0-openjdk-1.8.0.151-5.b12 * Using Cassandra Reaper 4.6.1 where we scheduled a repair with 32 segments/node (364 segments in total) I have attached some screenshots from our~11GB heap dump, where io.netty.util.concurrent.FastThreadLocalThread contributed towards 6.4GB of the heap size: * Problem Suspect 1: LongGC_Problem-Suspect-1_FastThreadLocalThread.png * Dominator Tree: LongGC_Dominator-Tree.png * Histogram: LongGC_Histogram.png I have also attached the output of "nodetool status": LongGC_nodetool_info.txt We have not tried out Cassandra 3.11.5, which apparently solved the Repair OOME issue: CASSANDRA-14096 was (Author: padakwaak): We have now experienced an issue that might be related to this, however our Cassandra did not crash yet - it just had frequent (every ~ 2 minutes) "ConcurrentMarkSweep GC" events of 16+ seconds! eg.: {noformat} Line 78776: WARN [Service Thread] 2019-12-10 08:03:19,969 GCInspector.java:282 - ConcurrentMarkSweep GC in 19129ms. CMS Old Gen: 7547650016 -> 7547650048; Par Eden Space: 671088640 -> 251798544; Par Survivor Space: 83886048 -> 0 Line 79080: WARN [Service Thread] 2019-12-10 08:03:37,565 GCInspector.java:282 - ConcurrentMarkSweep GC in 16379ms. Par Eden Space: 671088640 -> 254509608; Par Survivor Space: 83886032 -> 0{noformat} Sometimes it went back down to 200ms again, and after we did a "nodetool drain" and then removed the node from the cluster the GC time remained sub 250ms. Our setup is: * 5 nodes in dc1, 5 nodes in dc2. * RF: dc1=5, dc2=5 * CL = Local Quorum * Host with 32GB of RAM -> Cassandra allocates 8GB to heap * Java version: java-1.8.0-openjdk-1.8.0.151-5.b12 * Using Cassandra Reaper 4.6.1 where we scheduled a repair with 32 segments/node (364 segments in total) I have attached some screenshots from our~11GB heap dump, where io.netty.util.concurrent.FastThreadLocalThread contributed towards 6.4GB of the heap size: * Problem Suspect 1: LongGC_Problem-Suspect-1_FastThreadLocalThread.png * Dominator Tree: LongGC_Dominator-Tree.png * Histogram: LongGC_Histogram.png I have also attached the output of "nodetool status": LongGC_nodetool_info.txt We have not tried out Cassandra 3.11.5, which apparently solved the Repair OOME issue: CASSANDRA-14096 > Memory leak > ----------- > > Key: CASSANDRA-14355 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14355 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Core > Environment: Debian Jessie, OpenJDK 1.8.0_151 > Reporter: Eric Evans > Priority: Normal > Fix For: 3.11.x > > Attachments: 01_Screenshot from 2018-04-04 14-24-00.png, > 02_Screenshot from 2018-04-04 14-28-33.png, 03_Screenshot from 2018-04-04 > 14-24-50.png, LongGC_Dominator-Tree.png, LongGC_Histogram.png, > LongGC_Problem-Suspect-1_FastThreadLocalThread.png, LongGC_nodetool_info.txt > > > We're seeing regular, frequent {{OutOfMemoryError}} exceptions. Similar to > CASSANDRA-13754, an analysis of the heap dumps shows the heap consumed by the > {{threadLocals}} member of the instances of > {{io.netty.util.concurrent.FastThreadLocalThread}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org