Hi,
Below I try to give a full picture to the problem I'm facing.
This is a 12 node cluster, running on ec2 with m2.xlarge instances (17G
ram , 2 cpus).
Cassandra version is 1.0.8
Cluster normally having between 3000 - 1500 reads per second (depends on
time of the day) and 1700 - 800 writes per second- according to Opscetner.
RF=3, now row caches are used.
Memory relevant configs from cassandra.yaml:
flush_largest_memtables_at: 0.85
reduce_cache_sizes_at: 0.90
reduce_cache_capacity_to: 0.75
commitlog_total_space_in_mb: 4096
relevant JVM options used are:
-Xms8000M -Xmx8000M -Xmn400M
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSInitiatingOccupancyOnly"
Now what happens is that with these settings after cassandra process
restart, the GC it working fine at the beginning, and heap used looks like a
saw with perfect teeth, eventually the teeth size start to diminish
until the teeth become not noticable, and then cassandra starts to spend
lot's of CPU time
doing gc. It takes about 2 weeks until for such cycle , and then I need
to restart cassandra process to improve performance.
During all this time there are no memory related messages in cassandra
system.log, except a "GC for ParNew: little above 200ms" once in a while.
Things i've already done trying to reduce this eventual heap pressure.
1) reducing bloom_filter_fp_chance resulting in reduction from ~700MB
to ~280MB total per node based on all Filter.db files on the node.
2) reducing key cache sizes, and dropping key_caches for CFs which do no
not have many reads
3) the heap size was increased from 7000M to 8000M
All these have not really helped , just the increase from 7000M to
8000M, helped in increase the cycle till excessive gc from ~9 days to
~14 days.
I've tried to graph overtime the data that is supposed to be in heap vs
actual heap size, by summing up all CFs bloom filter sizes + all CFs key
cache capacities multipled by average key size + all CFs memtables data
size reported (i've overestimated the data size a bit on purpose to be
on the safe size).
Here is a link to graph showing last 2 day metrics for a node which
could not effectively do GC, and then cassandra process was restarted.
http://awesomescreenshot.com/0401w5y534
You can clearly see that before and after restart, the size of data that
is in supposed to be in heap, is the same pretty much the same,
which makes me think that I really need is GC tunning.
Also I suppose that this is not due to number of total keys each node
has , which is between 300 - 200 milions keys for all CF key estimates
summed on a code.
The nodes have datasize between 75G to 45G accordingly to milions of
keys. And all nodes are starting to have having GC heavy load after
about 14 days.
Also the excessive GC and heap usage are not affected by load which
varies depending on time of the day (see read/write rates at the
beginning of the mail).
So again based on this , I assume this is not due to large number of
keys or too much load on the cluster, but due to a pure GC
misconfiguration issue.
Things I remember that I've tried for GC tunning:
1) Changing -XX:MaxTenuringThreshold=1 to values like 8 - did not help.
2) Adding -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
-XX:CMSIncrementalDutyCycleMin=0
-XX:CMSIncrementalDutyCycle=10
-XX:ParallelGCThreads=2 JVM_OPTS -XX:ParallelCMSThreads=1
this actually made things worse.
3) Adding -XX:-XX-UseAdaptiveSizePolicy -XX:SurvivorRatio=8 - did not help.
Also since it takes like 2 weeks to verify that changing GC setting did
not help, the process is painfully slow to try all the possibilities :)
I'd highly appreciate any help and hints on the GC tunning.
tnx
Alex