[jira] [Comment Edited] (CASSANDRA-9549) Memory leak

2015-06-15 Thread Ivar Thorson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586335#comment-14586335
 ] 

Ivar Thorson edited comment on CASSANDRA-9549 at 6/15/15 7:57 PM:
--

As another data point, we upgraded our servers to 2.1.6 and see the same issue.


was (Author: ivar.thorson):
As another data point, we upgraded our servers to 5.1.6 and see the same issue.

> Memory leak 
> 
>
> Key: CASSANDRA-9549
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9549
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 
> 2 cores 7.5G memory, 800G platter for cassandra data, root partition and 
> commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 
> replica/zone
> JVM: /usr/java/jdk1.8.0_40/jre/bin/java 
> JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar 
> -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities 
> -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M 
> -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 
> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled 
> -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 
> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly 
> -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler 
> -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled 
> -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark 
> -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 
> -Dcom.sun.management.jmxremote.rmi.port=7199 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra 
> -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid 
> Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Ivar Thorson
>Priority: Critical
> Fix For: 2.1.x
>
> Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, 
> cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png
>
>
> We have been experiencing a severe memory leak with Cassandra 2.1.5 that, 
> over the period of a couple of days, eventually consumes all of the available 
> JVM heap space, putting the JVM into GC hell where it keeps trying CMS 
> collection but can't free up any heap space. This pattern happens for every 
> node in our cluster and is requiring rolling cassandra restarts just to keep 
> the cluster running. We have upgraded the cluster per Datastax docs from the 
> 2.0 branch a couple of months ago and have been using the data from this 
> cluster for more than a year without problem.
> As the heap fills up with non-GC-able objects, the CPU/OS load average grows 
> along with it. Heap dumps reveal an increasing number of 
> java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps 
> over a 2 day period, and watched the number of Node objects go from 4M, to 
> 19M, to 36M, and eventually about 65M objects before the node stops 
> responding. The screen capture of our heap dump is from the 19M measurement.
> Load on the cluster is minimal. We can see this effect even with only a 
> handful of writes per second. (See attachments for Opscenter snapshots during 
> very light loads and heavier loads). Even with only 5 reads a sec we see this 
> behavior.
> Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK 
> detected" messages:
> {code}
> ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error 
> when closing class 
> org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 
> rejected from 
> org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated,
>  pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644]
> {code}
> {code}
> ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK 
> DETECTED: a reference 
> (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class 
> org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151
>  was not released before the reference was garbage collected
> {code}
> This might be related to [CASSANDRA-8723]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-9549) Memory leak

2015-06-04 Thread Ivar Thorson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573595#comment-14573595
 ] 

Ivar Thorson edited comment on CASSANDRA-9549 at 6/4/15 10:58 PM:
--

We have tried increasing JVM heap size slightly to 3G, but we see the same 
issues. We cannot increase the heap size much more before reaching an 
unreasonably large fraction of total system memory (7.5G). We are not doing 
extensive deletions or overwrites. 

The log exceeds 20M when compressed, I will try to cut that down a bit and find 
the start point.


was (Author: ivar.thorson):
We have tried increasing JVM heap size slightly to 3G, but we see the same 
issues. We cannot increase the heap size much more before reaching an 
unreasonably large fraction of total system memory (7.5G). We are not doing 
extensive deletions or overwrites. 

The log exceeds 20M when compressed, try to cut that down a bit and find the 
start point.

> Memory leak 
> 
>
> Key: CASSANDRA-9549
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9549
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 
> 2 cores 7.5G memory, 800G platter for cassandra data, root partition and 
> commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 
> replica/zone
> JVM: /usr/java/jdk1.8.0_40/jre/bin/java 
> JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar 
> -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities 
> -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M 
> -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 
> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled 
> -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 
> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly 
> -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler 
> -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled 
> -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark 
> -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 
> -Dcom.sun.management.jmxremote.rmi.port=7199 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra 
> -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid 
> Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Ivar Thorson
>Priority: Critical
> Fix For: 2.1.x
>
> Attachments: c4_system.log, c7-system-fromboot.zip, cassandra.yaml, 
> cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png
>
>
> We have been experiencing a severe memory leak with Cassandra 2.1.5 that, 
> over the period of a couple of days, eventually consumes all of the available 
> JVM heap space, putting the JVM into GC hell where it keeps trying CMS 
> collection but can't free up any heap space. This pattern happens for every 
> node in our cluster and is requiring rolling cassandra restarts just to keep 
> the cluster running. We have upgraded the cluster per Datastax docs from the 
> 2.0 branch a couple of months ago and have been using the data from this 
> cluster for more than a year without problem.
> As the heap fills up with non-GC-able objects, the CPU/OS load average grows 
> along with it. Heap dumps reveal an increasing number of 
> java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps 
> over a 2 day period, and watched the number of Node objects go from 4M, to 
> 19M, to 36M, and eventually about 65M objects before the node stops 
> responding. The screen capture of our heap dump is from the 19M measurement.
> Load on the cluster is minimal. We can see this effect even with only a 
> handful of writes per second. (See attachments for Opscenter snapshots during 
> very light loads and heavier loads). Even with only 5 reads a sec we see this 
> behavior.
> Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK 
> detected" messages:
> {code}
> ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error 
> when closing class 
> org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 
> rejected from 
> org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated,
>  pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644]
> {code}
> {code}
> ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,

[jira] [Comment Edited] (CASSANDRA-9549) Memory leak

2015-06-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573567#comment-14573567
 ] 

Benedict edited comment on CASSANDRA-9549 at 6/4/15 9:14 PM:
-

Thanks. Unfortunately that does not seem to be the complete log history. It 
would help a great deal to have logs from when the node actually started up.

I can make an educated guess, though: it looks like the node was OOMing due to 
normal operational reasons (or perhaps some other issue, we cannot say), and we 
recently modified behaviour in this scenario to trigger a shutdown of the host. 
Unfortunately, it seems that the OOM is somehow delaying the shutdown from 
completing, or perhaps there is some other issue. Certainly the JVM thinks it 
is shutting down.

The strange thing is that the shutdown hook must still have been run, since 
that is the only way the executor service could be shutdown, only we ask the 
shutdown hook to be removed in this event. [~JoshuaMcKenzie], any ideas?

More complete logs would help us.

Increasing your heap space may fix the underlying problem. It may be that there 
is another underlying issue causing your heap to explode. To establish this we 
would need a heap dump during one of these events. If, however, you make 
extensive use of CQL row deletions, or CQL collections and perform overwrites 
of the entire collection, it may be that you are encountering CASSANDRA-9486, 
in which case a patch is available for that, and will be fixed in 2.1.6 to be 
released shortly.


was (Author: benedict):
Thanks. Unfortunately that does not seem to be the complete log history. It 
would help a great deal to have logs from when the node actually started up.

I can make an educated guess, though: it looks like the node was OOMing due to 
normal operational reasons (or perhaps some other issue, we cannot say), and we 
recently modified behaviour in this scenario to trigger a shutdown of the host. 
Unfortunately, it seems that the OOM is somehow delaying the shutdown from 
completing, or perhaps there is some other issue. Certainly the JVM thinks it 
is shutting down.

The strange thing is that the shutdown hook must still have been run, since 
that is the only way the executor service could be shutdown, only we ask the 
shutdown hook to be removed in this event.

More complete logs would help us.

Increasing your heap space may fix the underlying problem. It may be that there 
is another underlying issue causing your heap to explode. To establish this we 
would need a heap dump during one of these events. If, however, you make 
extensive use of CQL row deletions, or CQL collections and perform overwrites 
of the entire collection, it may be that you are encountering CASSANDRA-9486, 
in which case a patch is available for that, and will be fixed in 2.1.6 to be 
released shortly.

> Memory leak 
> 
>
> Key: CASSANDRA-9549
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9549
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 
> 2 cores 7.5G memory, 800G platter for cassandra data, root partition and 
> commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 
> replica/zone
> JVM: /usr/java/jdk1.8.0_40/jre/bin/java 
> JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar 
> -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities 
> -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M 
> -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 
> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled 
> -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 
> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly 
> -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler 
> -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled 
> -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=1 -XX:+UseCondCardMark 
> -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 
> -Dcom.sun.management.jmxremote.rmi.port=7199 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra 
> -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid 
> Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Ivar Thorson
>Priority: Critical
> Fix For: 2.1.x
>
> Attachments: c4_system.log, cassandra.yaml, cpu-load.png, 
> memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png
>
>
> We have been experiencing a severe memory leak with Cassandra 2.1.5 that, 
> over the period of a couple of days, eventually consumes