Hi,

Below I try to give a full picture to the problem I'm facing.

This is a 12 node cluster, running on ec2 with m2.xlarge instances (17G ram , 2 cpus).
Cassandra version is 1.0.8
Cluster normally having between 3000 - 1500 reads per second (depends on time of the day) and 1700 - 800 writes per second- according to Opscetner.
RF=3, now row caches are used.

Memory relevant  configs from cassandra.yaml:
flush_largest_memtables_at: 0.85
reduce_cache_sizes_at: 0.90
reduce_cache_capacity_to: 0.75
commitlog_total_space_in_mb: 4096

relevant JVM options used are:
-Xms8000M -Xmx8000M -Xmn400M
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSInitiatingOccupancyOnly"

Now what happens is that with these settings after cassandra process restart, the GC it working fine at the beginning, and heap used looks like a saw with perfect teeth, eventually the teeth size start to diminish until the teeth become not noticable, and then cassandra starts to spend lot's of CPU time doing gc. It takes about 2 weeks until for such cycle , and then I need to restart cassandra process to improve performance. During all this time there are no memory related messages in cassandra system.log, except a "GC for ParNew: little above 200ms" once in a while.

Things i've already done trying to reduce this eventual heap pressure.
1) reducing bloom_filter_fp_chance resulting in reduction from ~700MB to ~280MB total per node based on all Filter.db files on the node. 2) reducing key cache sizes, and dropping key_caches for CFs which do no not have many reads
3) the heap size was increased from 7000M to 8000M
All these have not really helped , just the increase from 7000M to 8000M, helped in increase the cycle till excessive gc from ~9 days to ~14 days.

I've tried to graph overtime the data that is supposed to be in heap vs actual heap size, by summing up all CFs bloom filter sizes + all CFs key cache capacities multipled by average key size + all CFs memtables data size reported (i've overestimated the data size a bit on purpose to be on the safe size). Here is a link to graph showing last 2 day metrics for a node which could not effectively do GC, and then cassandra process was restarted.
http://awesomescreenshot.com/0401w5y534
You can clearly see that before and after restart, the size of data that is in supposed to be in heap, is the same pretty much the same,
which makes me think that I really need is GC tunning.

Also I suppose that this is not due to number of total keys each node has , which is between 300 - 200 milions keys for all CF key estimates summed on a code. The nodes have datasize between 75G to 45G accordingly to milions of keys. And all nodes are starting to have having GC heavy load after about 14 days. Also the excessive GC and heap usage are not affected by load which varies depending on time of the day (see read/write rates at the beginning of the mail). So again based on this , I assume this is not due to large number of keys or too much load on the cluster, but due to a pure GC misconfiguration issue.

Things I remember that I've tried for GC tunning:
1) Changing -XX:MaxTenuringThreshold=1 to values like 8 - did not help.
2) Adding -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:CMSIncrementalDutyCycleMin=0 -XX:CMSIncrementalDutyCycle=10 -XX:ParallelGCThreads=2 JVM_OPTS -XX:ParallelCMSThreads=1
    this actually made things worse.
3) Adding -XX:-XX-UseAdaptiveSizePolicy -XX:SurvivorRatio=8 - did not help.

Also since it takes like 2 weeks to verify that changing GC setting did not help, the process is painfully slow to try all the possibilities :)
I'd highly appreciate any help and hints on the GC tunning.

tnx
Alex






Reply via email to