Hello Team, I have a cluster of 17 nodes in production.(8 and 9 nodes in 2 DC). Cassandra version: 2.0.11 Client connecting using thrift over port 9160 Jdk version : 1.8.066 GC used : G1GC (16GB heap) Other GC settings: Maxgcpausemillis=200 Parallels gc threads=32 Concurrent gc threads= 10 Initiatingheapoccupancypercent=50 Number of cpu cores for each system : 40 Memory size: 185 GB Read/sec : 300 /sec on each node Writes/sec : 300/sec on each node Compaction strategy used : Size tiered compaction strategy
Identified issues in the cluster: 1. Disk space usage across all nodes in the cluster is 80%. We are currently working on adding more storage on each node 2. There are 2 tables for which we keep on seeing large number of tombstones. One of table has read requests seeing 120 tombstones cells in last 5 mins as compared to 4 live cells. Tombstone warns and Error messages of query getting aborted is also seen. Current issue sen: 1. We keep on seeing GC pauses of few minutes randomly across nodes in the cluster. GC pauses of 120 seconds, even 770 seconds are also seen. 2. This leads to nodes getting stalled and client seeing direct impact 3. The GC pause we see, are not during any of G1GC phases. The GC log message prints “Time to stop threads took 770 seconds”. So it is not the garbage collector doing any work but stopping the threads at a safe point is taking so much of time. 4. This issue has surfaced recently after we changed 8GB(CMS) to 16GB(G1GC) across all nodes in the cluster. Kindly do help on the above issue. I am not able to exactly understand if the GC is wrongly tuned, other if this is something else. Thanks, Rajsekhar Mallick --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org