[jira] [Commented] (CASSANDRA-9549) Memory leak in Ref.GlobalState due to pathological ConcurrentLinkedQueue.remove behaviour
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238615#comment-15238615 ] stone commented on CASSANDRA-9549: -- @Benedict Thanks for your answer. I got it. open a ticket https://issues.apache.org/jira/browse/CASSANDRA-11460 at first,I thought it same as this ticket,and now I realize I made a mistake. the ticket is opened about 2 weeks,but still no response,can you help a look. > Memory leak in Ref.GlobalState due to pathological > ConcurrentLinkedQueue.remove behaviour > - > > Key: CASSANDRA-9549 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 > Project: Cassandra > Issue Type: Bug > Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, > 2 cores 7.5G memory, 800G platter for cassandra data, root partition and > commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 > replica/zone > JVM: /usr/java/jdk1.8.0_40/jre/bin/java > JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar > -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities > -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M > -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 > -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled > -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 > -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly > -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler > -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled > -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark > -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 > -Dcom.sun.management.jmxremote.rmi.port=7199 > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.authenticate=false > -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra > -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid > Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux >Reporter: Ivar Thorson >Assignee: Benedict >Priority: Critical > Fix For: 2.1.7 > > Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, > cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png > > > We have been experiencing a severe memory leak with Cassandra 2.1.5 that, > over the period of a couple of days, eventually consumes all of the available > JVM heap space, putting the JVM into GC hell where it keeps trying CMS > collection but can't free up any heap space. This pattern happens for every > node in our cluster and is requiring rolling cassandra restarts just to keep > the cluster running. We have upgraded the cluster per Datastax docs from the > 2.0 branch a couple of months ago and have been using the data from this > cluster for more than a year without problem. > As the heap fills up with non-GC-able objects, the CPU/OS load average grows > along with it. Heap dumps reveal an increasing number of > java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps > over a 2 day period, and watched the number of Node objects go from 4M, to > 19M, to 36M, and eventually about 65M objects before the node stops > responding. The screen capture of our heap dump is from the 19M measurement. > Load on the cluster is minimal. We can see this effect even with only a > handful of writes per second. (See attachments for Opscenter snapshots during > very light loads and heavier loads). Even with only 5 reads a sec we see this > behavior. > Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK > detected" messages: > {code} > ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error > when closing class > org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 > rejected from > org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, > pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] > {code} > {code} > ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK > DETECTED: a reference > (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class > org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 > was not released before the reference was garbage collected > {code} > This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian
[jira] [Commented] (CASSANDRA-9549) Memory leak in Ref.GlobalState due to pathological ConcurrentLinkedQueue.remove behaviour
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233832#comment-15233832 ] Benedict commented on CASSANDRA-9549: - What is obtuse? bq. how to resolve? Move to a version >= fixVersion, i.e. 2.1.7 bq. why this happen The [last comment with more than one sentence|https://issues.apache.org/jira/browse/CASSANDRA-9549?focusedCommentId=14586587=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14586587], only six comments back, spells out what happened and why. I realise JIRA noise can be quite an issue in many cases, but in this instance it seems to me that just a modicum of effort was necessary to find the answers you sought. > Memory leak in Ref.GlobalState due to pathological > ConcurrentLinkedQueue.remove behaviour > - > > Key: CASSANDRA-9549 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 > Project: Cassandra > Issue Type: Bug > Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, > 2 cores 7.5G memory, 800G platter for cassandra data, root partition and > commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 > replica/zone > JVM: /usr/java/jdk1.8.0_40/jre/bin/java > JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar > -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities > -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M > -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 > -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled > -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 > -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly > -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler > -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled > -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark > -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 > -Dcom.sun.management.jmxremote.rmi.port=7199 > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.authenticate=false > -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra > -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid > Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux >Reporter: Ivar Thorson >Assignee: Benedict >Priority: Critical > Fix For: 2.1.7 > > Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, > cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png > > > We have been experiencing a severe memory leak with Cassandra 2.1.5 that, > over the period of a couple of days, eventually consumes all of the available > JVM heap space, putting the JVM into GC hell where it keeps trying CMS > collection but can't free up any heap space. This pattern happens for every > node in our cluster and is requiring rolling cassandra restarts just to keep > the cluster running. We have upgraded the cluster per Datastax docs from the > 2.0 branch a couple of months ago and have been using the data from this > cluster for more than a year without problem. > As the heap fills up with non-GC-able objects, the CPU/OS load average grows > along with it. Heap dumps reveal an increasing number of > java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps > over a 2 day period, and watched the number of Node objects go from 4M, to > 19M, to 36M, and eventually about 65M objects before the node stops > responding. The screen capture of our heap dump is from the 19M measurement. > Load on the cluster is minimal. We can see this effect even with only a > handful of writes per second. (See attachments for Opscenter snapshots during > very light loads and heavier loads). Even with only 5 reads a sec we see this > behavior. > Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK > detected" messages: > {code} > ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error > when closing class > org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 > rejected from > org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, > pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] > {code} > {code} > ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK > DETECTED: a reference > (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92)
[jira] [Commented] (CASSANDRA-9549) Memory leak in Ref.GlobalState due to pathological ConcurrentLinkedQueue.remove behaviour
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233823#comment-15233823 ] stone commented on CASSANDRA-9549: -- could you take a summary after resolving the issue? why this happen?,how to resolve? actually,it's hard to find the answer when people met the same issue. > Memory leak in Ref.GlobalState due to pathological > ConcurrentLinkedQueue.remove behaviour > - > > Key: CASSANDRA-9549 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 > Project: Cassandra > Issue Type: Bug > Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, > 2 cores 7.5G memory, 800G platter for cassandra data, root partition and > commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 > replica/zone > JVM: /usr/java/jdk1.8.0_40/jre/bin/java > JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar > -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities > -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M > -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 > -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled > -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 > -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly > -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler > -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled > -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark > -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 > -Dcom.sun.management.jmxremote.rmi.port=7199 > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.authenticate=false > -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra > -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid > Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux >Reporter: Ivar Thorson >Assignee: Benedict >Priority: Critical > Fix For: 2.1.7 > > Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, > cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png > > > We have been experiencing a severe memory leak with Cassandra 2.1.5 that, > over the period of a couple of days, eventually consumes all of the available > JVM heap space, putting the JVM into GC hell where it keeps trying CMS > collection but can't free up any heap space. This pattern happens for every > node in our cluster and is requiring rolling cassandra restarts just to keep > the cluster running. We have upgraded the cluster per Datastax docs from the > 2.0 branch a couple of months ago and have been using the data from this > cluster for more than a year without problem. > As the heap fills up with non-GC-able objects, the CPU/OS load average grows > along with it. Heap dumps reveal an increasing number of > java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps > over a 2 day period, and watched the number of Node objects go from 4M, to > 19M, to 36M, and eventually about 65M objects before the node stops > responding. The screen capture of our heap dump is from the 19M measurement. > Load on the cluster is minimal. We can see this effect even with only a > handful of writes per second. (See attachments for Opscenter snapshots during > very light loads and heavier loads). Even with only 5 reads a sec we see this > behavior. > Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK > detected" messages: > {code} > ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error > when closing class > org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 > rejected from > org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, > pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] > {code} > {code} > ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK > DETECTED: a reference > (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class > org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 > was not released before the reference was garbage collected > {code} > This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak in Ref.GlobalState due to pathological ConcurrentLinkedQueue.remove behaviour
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968882#comment-14968882 ] Maxim Podkolzine commented on CASSANDRA-9549: - Is this bug fixed in Cassandra 2.2.0? > Memory leak in Ref.GlobalState due to pathological > ConcurrentLinkedQueue.remove behaviour > - > > Key: CASSANDRA-9549 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, > 2 cores 7.5G memory, 800G platter for cassandra data, root partition and > commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 > replica/zone > JVM: /usr/java/jdk1.8.0_40/jre/bin/java > JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar > -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities > -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M > -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 > -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled > -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 > -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly > -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler > -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled > -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark > -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 > -Dcom.sun.management.jmxremote.rmi.port=7199 > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.authenticate=false > -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra > -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid > Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux >Reporter: Ivar Thorson >Assignee: Benedict >Priority: Critical > Fix For: 2.1.7 > > Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, > cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png > > > We have been experiencing a severe memory leak with Cassandra 2.1.5 that, > over the period of a couple of days, eventually consumes all of the available > JVM heap space, putting the JVM into GC hell where it keeps trying CMS > collection but can't free up any heap space. This pattern happens for every > node in our cluster and is requiring rolling cassandra restarts just to keep > the cluster running. We have upgraded the cluster per Datastax docs from the > 2.0 branch a couple of months ago and have been using the data from this > cluster for more than a year without problem. > As the heap fills up with non-GC-able objects, the CPU/OS load average grows > along with it. Heap dumps reveal an increasing number of > java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps > over a 2 day period, and watched the number of Node objects go from 4M, to > 19M, to 36M, and eventually about 65M objects before the node stops > responding. The screen capture of our heap dump is from the 19M measurement. > Load on the cluster is minimal. We can see this effect even with only a > handful of writes per second. (See attachments for Opscenter snapshots during > very light loads and heavier loads). Even with only 5 reads a sec we see this > behavior. > Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK > detected" messages: > {code} > ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error > when closing class > org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 > rejected from > org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, > pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] > {code} > {code} > ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK > DETECTED: a reference > (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class > org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 > was not released before the reference was garbage collected > {code} > This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589683#comment-14589683 ] Marcus Eriksson commented on CASSANDRA-9549: +1 Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Assignee: Benedict Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589537#comment-14589537 ] Benedict commented on CASSANDRA-9549: - I've added a regression test to the branch. Could I get a reviewer please, and can we ship this soon? Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Assignee: Benedict Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589997#comment-14589997 ] Ivar Thorson commented on CASSANDRA-9549: - We patched our 2.1.6 cluster on Wednesday and let it run for a day to let things accumulate. Looking at CPU activity and heap space for the last day suggests that the memory leak seems to have been fixed by the patch. Awesome work! Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Assignee: Benedict Priority: Critical Fix For: 2.1.7 Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1458#comment-1458 ] Benedict commented on CASSANDRA-9549: - Great, glad to hear it Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Assignee: Benedict Priority: Critical Fix For: 2.1.7 Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586425#comment-14586425 ] Jeff Jirsa commented on CASSANDRA-9549: --- Throwing a me-too here, copying summary from IRC (on the topic of 2.1.6 showing weird memory behavior that feels like a leak). Other user was also using DTCS: 11:07 jeffj opened CASSANDRA-9597 last night. dtcs + streaming = lots of sstables that won't compact efficiently and eventually (days after load is stopped) nodes end up ooming or in gc hell. 11:08 jeffj in our case, the PROBLEM is that sstables build up over time due to the weird way dtcs is selecting candidates to compact, but the symptom is very very very long gc pauses and eventual ooms. 11:10 jeffj i would very much believe there's a leak somewhere in 2.1.6. in our case, we saw the same behavior in 2.1.5, so i dont think it's a single minor version regression Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586460#comment-14586460 ] Robbie Strickland commented on CASSANDRA-9549: -- We also experience this issue on 2.1.5, and also running DTCS. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586335#comment-14586335 ] Ivar Thorson commented on CASSANDRA-9549: - As another data point, we upgraded our servers to 5.1.6 and see the same issue. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586549#comment-14586549 ] Benedict commented on CASSANDRA-9549: - Sorry for the slow response. This one slipped off my work queue. I've pushed a fix [here|https://github.com/belliottsmith/cassandra/tree/9549]. The problem is that I made erroneous assumptions about the behaviour of CLQ on remove (I've read too many CLQ implementations to keep them all straight, I guess). The problem is that on remove, it does not unlink the node it has removed from, it only sets the item to null. This means we accumulate the CLQ nodes for the whole lifetime of the Ref (in this case an sstable). DTCS obviously exacerbates this by ensuring sstable lifetimes are infinite. This patch simply swaps that to a CLDeque. This has some undesirable properties, so we should probably hasten CASSANDRA-9379. This would have prevented this, and will generally improve our management of Ref instances. I've also filed a follow up ticket, CASSANDRA-9600, which would have mitigated this. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586556#comment-14586556 ] Benedict commented on CASSANDRA-9549: - Actually, scratch that... it does look like CLQ should remove the node. And yet, it isn't doing so, if the heap dump is to be believed. I suspect the patched branch will fix the problem, but will see if I can puzzle out a plausible mechanism by which the nodes are accumulating. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586587#comment-14586587 ] Benedict commented on CASSANDRA-9549: - Ahhh. So, there is a pathological case in CLQ.remove. If the item you delete was the last to be inserted, it will not expunge the node. However if it also does not expunge any deleted items en route to the end. So, if you retain the first to be inserted, and you always delete the last, you get an infinitely growing, but completely empty, middle of the CLQ. This is pretty easily avoided, so might be worth an upstream patch to the JDK. However for now the patch I uploaded should fix the problem (which I'm more confident of, now there is an explanatory framework), and CASSANDRA-9379 remains the correct follow up to ensure no pathological list behaviours (e.g. with lots of extant Ref instances). Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14577420#comment-14577420 ] Ivar Thorson commented on CASSANDRA-9549: - https://drive.google.com/a/whibse.com/file/d/0BxS4YrlxXzqAODNaTHBqY2ZGZlE/view?usp=sharing Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575682#comment-14575682 ] Benedict commented on CASSANDRA-9549: - Wherever is convenient for you to put it. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574121#comment-14574121 ] Benedict commented on CASSANDRA-9549: - That still seems to be missing the usual startup log messages, and must have been running for some time since the CompactionExecutor and MemtableFlusher pools both have thread ids above 100. It looks like it is already under significant heap pressure at that time. Unfortunately it is very hard to say why, likely even with the complete logs. At this point we really need a heap dump to analyse. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7-system-fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574905#comment-14574905 ] Joshua McKenzie commented on CASSANDRA-9549: CASSANDRA-8092 is still open. We have quite a few more swallowed exceptions since last I went through the code-base and fixed them: {noformat} Total caught and rethrown as something other than Runtime: 82 Total caught and rethrown as Runtime: 68 Total Swallowed: 40 Total delegated to JVMStabilityInspector: 66 Total 'catch (Throwable ...)' analyzed: 79 Total 'catch (Exception ...)' analyzed: 177 Total catch clauses analyzed: 256 {noformat} So in this instance, I wouldn't bank on the shutdown hook having been unregistered. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574811#comment-14574811 ] Ivar Thorson commented on CASSANDRA-9549: - Sorry, I had difficulty figuring out where the log starts because I've been working from a large, concatenated file, and keep mixing UTC and PST time zones. I uploaded c7fromboot.zip, which seems to start from the right place. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575430#comment-14575430 ] Ivar Thorson commented on CASSANDRA-9549: - I'd be happy to provide a heap dump, but even zipped it's 200MB. FTP? Google drive? Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573595#comment-14573595 ] Ivar Thorson commented on CASSANDRA-9549: - We have tried increasing JVM heap size slightly to 3G, but we see the same issues. We cannot increase the heap size much more before reaching an unreasonably large fraction of total system memory (7.5G). We are not doing extensive deletions or overwrites. The log exceeds 20M when compressed, try to cut that down a bit and find the start point. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573528#comment-14573528 ] Ivar Thorson commented on CASSANDRA-9549: - Log file uploaded. We're running the datastax rpms and restarting with service cassandra restart Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573538#comment-14573538 ] Benedict commented on CASSANDRA-9549: - Thanks. This error specifically is related to that change, but the underlying cause is most likely not. With the full log file we can probably glean enough information to suppress this _presentation_ of the problem, but the service would still be shutdown while the system is running and this would eventually lead to other problems. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573640#comment-14573640 ] Ivar Thorson commented on CASSANDRA-9549: - Uploaded a new log for our c7 node, after spending time finding when the node was last restarted. Let me know if I am still truncating the log at the wrong points. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, c7-system-fromboot.zip, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573567#comment-14573567 ] Benedict commented on CASSANDRA-9549: - Thanks. Unfortunately that does not seem to be the complete log history. It would help a great deal to have logs from when the node actually started up. I can make an educated guess, though: it looks like the node was OOMing due to normal operational reasons (or perhaps some other issue, we cannot say), and we recently modified behaviour in this scenario to trigger a shutdown of the host. Unfortunately, it seems that the OOM is somehow delaying the shutdown from completing, or perhaps there is some other issue. Certainly the JVM thinks it is shutting down. The strange thing is that the shutdown hook must still have been run, since that is the only way the executor service could be shutdown, only we ask the shutdown hook to be removed in this event. More complete logs would help us. Increasing your heap space may fix the underlying problem. It may be that there is another underlying issue causing your heap to explode. To establish this we would need a heap dump during one of these events. If, however, you make extensive use of CQL row deletions, or CQL collections and perform overwrites of the entire collection, it may be that you are encountering CASSANDRA-9486, in which case a patch is available for that, and will be fixed in 2.1.6 to be released shortly. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: c4_system.log, cassandra.yaml, cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573456#comment-14573456 ] Benedict commented on CASSANDRA-9549: - It's possible there is a script in their envrionment running periodically, asking the servers to drain. There are really very few ways for that executor service to be shutdown (assuming it's the executor submitted to inside of the method throwing the REE; it's hard to say with absolute certainty because the stack trace has been compressed due to the frequency of the error generation): the shutdown hook indicating the VM is terminating, or the drain() command. As I said, though: more info, means we can say with greater certainty. That full log history since restart would be a great start. A thread dump would be the natural follow on if that was not sufficiently helpful. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: cassandra.yaml, cpu-load.png, memoryuse.png, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573492#comment-14573492 ] Ivar Thorson commented on CASSANDRA-9549: - Our sysadmin has been doing drain just before restarting, but it is not periodic. The only periodic crontab command is a weekly repair of each node, done in a rolling fashion. We looked for correlation with this memory leak problem and found none. Is there something else that would cause this drain-like behavior? Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: cassandra.yaml, cpu-load.png, memoryuse.png, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573400#comment-14573400 ] Benedict commented on CASSANDRA-9549: - Looks like you've called drain(), but the server is still up and trying to do work... A full system log (back until node startup) could help, but this situation should be pretty atypical. Restarting the node should be enough to correct it. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: cassandra.yaml, cpu-load.png, memoryuse.png, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573404#comment-14573404 ] Philip Thompson commented on CASSANDRA-9549: Original description says it's happening for every node in the cluster, and that they've all been restarted. Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: cassandra.yaml, cpu-load.png, memoryuse.png, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573506#comment-14573506 ] Benedict commented on CASSANDRA-9549: - Without the log file there is very little more I can tell you. The only two places the ES is explicitly shutdown are: # a drain; and # the VM executing its shutdown hooks The only two places a drain occurs are: # via NodeTool drain # receipt of a gossip remove node message (which should, by my understanding, only be triggered by a NodeTool remove command) It's possible something else is awry, but we have very little information to work with. Is it possible you are running an embedded cassandra, so that the Cassandra instance restarts without the JVM restarting? Or is it possible you are draining more nodes than you intend during the restart process? Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: cassandra.yaml, cpu-load.png, memoryuse.png, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the
[jira] [Commented] (CASSANDRA-9549) Memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573513#comment-14573513 ] Ivar Thorson commented on CASSANDRA-9549: - I'll look at getting the log and thread dump. Is this related to changes for [CASSANDRA-8707]? Memory leak Key: CASSANDRA-9549 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux Reporter: Ivar Thorson Priority: Critical Fix For: 2.1.x Attachments: cassandra.yaml, cpu-load.png, memoryuse.png, suspect.png, two-loads.png We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem. As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement. Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior. Log files show repeated errors in Ref.java:181 and Ref.java:279 and LEAK detected messages: {code} ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644] {code} {code} ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected {code} This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)