[ 
https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586549#comment-14586549
 ] 

Benedict commented on CASSANDRA-9549:
-------------------------------------

Sorry for the slow response. This one slipped off my work queue. I've pushed a 
fix [here|https://github.com/belliottsmith/cassandra/tree/9549]. The problem is 
that I made erroneous assumptions about the behaviour of CLQ on remove (I've 
read too many CLQ implementations to keep them all straight, I guess). The 
problem is that on remove, it does not unlink the node it has removed from, it 
only sets the item to null. This means we accumulate the CLQ nodes for the 
whole lifetime of the Ref (in this case an sstable). DTCS obviously exacerbates 
this by ensuring sstable lifetimes are infinite.

This patch simply swaps that to a CLDeque. This has some undesirable 
properties, so we should probably hasten CASSANDRA-9379. This would have 
prevented this, and will generally improve our management of Ref instances.

I've also filed a follow up ticket, CASSANDRA-9600, which would have mitigated 
this.


> Memory leak 
> ------------
>
>                 Key: CASSANDRA-9549
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 
> 2 cores 7.5G memory, 800G platter for cassandra data, root partition and 
> commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 
> replica/zone
> JVM: /usr/java/jdk1.8.0_40/jre/bin/java 
> JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar 
> -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities 
> -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M 
> -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 
> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled 
> -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 
> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly 
> -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler 
> -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled 
> -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=10000 -XX:+UseCondCardMark 
> -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 
> -Dcom.sun.management.jmxremote.rmi.port=7199 
> -Dcom.sun.management.jmxremote.ssl=false 
> -Dcom.sun.management.jmxremote.authenticate=false 
> -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra 
> -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid 
> Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux
>            Reporter: Ivar Thorson
>            Priority: Critical
>             Fix For: 2.1.x
>
>         Attachments: c4_system.log, c7fromboot.zip, cassandra.yaml, 
> cpu-load.png, memoryuse.png, ref-java-errors.jpeg, suspect.png, two-loads.png
>
>
> We have been experiencing a severe memory leak with Cassandra 2.1.5 that, 
> over the period of a couple of days, eventually consumes all of the available 
> JVM heap space, putting the JVM into GC hell where it keeps trying CMS 
> collection but can't free up any heap space. This pattern happens for every 
> node in our cluster and is requiring rolling cassandra restarts just to keep 
> the cluster running. We have upgraded the cluster per Datastax docs from the 
> 2.0 branch a couple of months ago and have been using the data from this 
> cluster for more than a year without problem.
> As the heap fills up with non-GC-able objects, the CPU/OS load average grows 
> along with it. Heap dumps reveal an increasing number of 
> java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps 
> over a 2 day period, and watched the number of Node objects go from 4M, to 
> 19M, to 36M, and eventually about 65M objects before the node stops 
> responding. The screen capture of our heap dump is from the 19M measurement.
> Load on the cluster is minimal. We can see this effect even with only a 
> handful of writes per second. (See attachments for Opscenter snapshots during 
> very light loads and heavier loads). Even with only 5 reads a sec we see this 
> behavior.
> Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK 
> detected" messages:
> {code}
> ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error 
> when closing class 
> org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 
> rejected from 
> org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated,
>  pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644]
> {code}
> {code}
> ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK 
> DETECTED: a reference 
> (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class 
> org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151
>  was not released before the reference was garbage collected
> {code}
> This might be related to [CASSANDRA-8723]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to