Cheng Ren created CASSANDRA-10150:
-------------------------------------

             Summary: Cassandra read latency potentially caused by memory leak
                 Key: CASSANDRA-10150
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10150
             Project: Cassandra
          Issue Type: Bug
          Components: Core
         Environment: cassandra 2.0.12
            Reporter: Cheng Ren


  We are currently migrating to a new cassandra cluster which is multi-region 
on ec2.  Our previous cluster was also on ec2 but only in the east region.  In 
addition we have upgraded to cassandra 2.0.12 from 2.0.4 and from ubuntu 12 to 
14.

  We are investigating a cassandra latency problem on our new cluster.  The 
symptom is that over a long period of time (12-16 hours) the TP90-95 read 
latency degrades to the point of being well above our SLA's.  During normal 
operation our TP95 for a 50key lookup is 75ms, when fully degraded, we are 
facing 300ms TP95 latencies.  Doing a rolling restart resolves the problem.

We are noticing a high correlation between the Old Gen heap usage (and how much 
is freed up) and the high latencies.  We are running with a max heap size of 
12GB and a max new-gen size of 2GB.

Below is a chart of the heap usage over a 24 hour period.  Right below it is a 
chart of TP95 latencies (was a mixed workload of 50 and single key lookups), 
the third image is a look at CMS Old Gen memory usage:
Overall heap usage over 24 hrs:
!https://mail.google.com/mail/u/0/?ui=2&ik=0c69b03890&view=fimg&th=14f4d1d4381a0760&attid=0.2&disp=emb&realattid=ii_14f4256a57b697ab&attbid=ANGjdJ8836xDhsdopJteTGvid1FXOcMruq1Pz9fCkoasJ1Zsf2cQCXpbQ3CUB8DOupdYHstLw4n5xg9oXpWmSmp6FAvg3CnO9q7BlDNZ-EmMIy4tIg1yprl8ipDtgzw&sz=w908-h372&ats=1440110174663&rm=14f4d1d4381a0760&zw&atsh=1|height=300,width=500!
TP95 latencies over 24 hours:
!https://mail.google.com/mail/u/0/?ui=2&ik=0c69b03890&view=fimg&th=14f4d1d4381a0760&attid=0.1&disp=emb&realattid=ii_14f42580ee666154&attbid=ANGjdJ8e959Qch4PmY57AAg-qi3cPMTX_p-33H4Snd1igoxQQ5N0owSRHKEBT-M2gzKKzfMmx0WwUnImJDDMkZcWqeiHieLrGgHJX4i3-Ust8tPrgMDQxe6C_2c3N40&sz=w908-h372&ats=1440110174664&rm=14f4d1d4381a0760&zw&atsh=1|height=300,width=500!
OldGen memory usage over 24 hours:
!https://mail.google.com/mail/u/0/?ui=2&ik=0c69b03890&view=fimg&th=14f4d1d4381a0760&attid=0.4&disp=emb&realattid=ii_14f4258cc47c9d36&attbid=ANGjdJ9LgcECnife3mdKz1JlhDWur7KjiVtbEYYCFyxh0xoF9yEC4Q_90PS56PhU1hOraDiYCDQ1ro0dcOtQhqEU70Pwoc--wsdXbpbWmhJ5hF7QC2FDRS8zpuX_KC0&sz=w908-h390&ats=1440110174664&rm=14f4d1d4381a0760&zw&atsh=1|height=300,width=500!

 You can see from this that the old gen section of our heap is what is using up 
the majority of the heap space.  We cannot figure out why the memory is not 
being collected during a full GC.  For reference, in our old cassandra cluster, 
the behavior is that the full GC will clear up the majority of the heap space.  
See image below from an old production node operating normally:

!https://mail.google.com/mail/u/0/?ui=2&ik=0c69b03890&view=fimg&th=14f4d1d4381a0760&attid=0.3&disp=emb&realattid=ii_14f4262f2c3781bb&attbid=ANGjdJ_G3oT4ITmlQMJe16jsYpYINHC1j6dqxvZ5RKfjMp5YUj1VA71_VfWTqUP47wsuRqb6GkeAk_1BllaL6D5bjn0QvScXBPIsr5L4uFMBEMpGZAvRzKaC9Q3xXrs&sz=w908-h390&ats=1440110174664&rm=14f4d1d4381a0760&zw&atsh=1|height=300,width=500!

>From heap dump file we found that most memory is consumed by unreachable 
>objects. With further analysis we were able to see those objects are 
>RMIConnectionImpl$CombinedClassLoader$ClassLoaderWrapper (holding 4GB of 
>memory) and java.security.ProtectionDomain (holding 2GB) . The only place we 
>know Cassandra is using RMI is in JMX, but
does anyone has any clue on where else those objects are used? And Why do they 
take so much memory?
Or It would be great if someone could offer any further debugging tips on the 
latency or GC issue.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to