I am having some reliability problems in my Cassandra cluster which I am
almost certain is due to GC. I was about to start delving into the guts of
the problem by turning on GC logging but I have never done any serious java
GC tuning before (time to learn I guess). As a first step however, I was
hoping to gain some insight into the GC settings shipped with Cassandra 0.7.
I realize its a pretty complicated problem but I was specifically interested
in knowing about:

-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75

Why are these set the way they are? What specifically was used to determine
these settings? Was it purely experimental or was there a specific,
undesirable behavior adding these settings corrected for? From my various
web wanderings, I read the survivor ratio and tenuring threshold settings as
"Cassandra creates mostly long lived objects, with objects being promoted
very quickly from the young generation to the old generation". Furthermore,
the CMSInitiatingOccupancyFraction of 75 (from a JVM default of 68) means
"start gc in the old generation later", presumably to allow Cassandra to use
more of the old generation heap without needlessly trying to free up used
space (?). Please correct me if I am misinterpreting these settings.

One of the issues I have been having is extreme node instability when
running a major compaction. After 20-30 seconds of operation, the node
spends 30+ seconds in (what I believe to be) GC. Now I have tried halving
all memtable thresholds to reduce overall heap memory usage but that has not
seemed to help with the instability. After one of these blips, I often see
log entries as follows:

 INFO [ScheduledTasks:1] 2011-01-17 10:41:21,961 GCInspector.java (line 133)
GC for ParNew: 215 ms, 45084168 reclaimed leaving 11068700368 used; max is
12783583232
 INFO [ScheduledTasks:1] 2011-01-17 10:41:28,033 GCInspector.java (line 133)
GC for ParNew: 234 ms, 40401120 reclaimed leaving 12144504848 used; max is
12783583232
 INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line 133)
GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving
9224048472 used; max is 12783583232

Given that the 3 GB of garbage collected via ConcurrentMarkSweep was
generated in < 30 seconds, one of the first things I was going to try was
increasing the survivor ratio (to 16) and increase the MaxTenuringThreshold
(to 5) to try and keep more objects in the young generation and therefore
cleaned up faster. As a more general approach to solving my problem, I was
also going to reduce the CMSInitiatingOccupancyFraction to 65. Does this
seem reasonable? Obviously, the best answer is to just try it but I hesitate
to start playing with settings when I have only vaguest notions of what they
do and little concept of why they are there in the first place.

Thanks for any help

Reply via email to