Hello Team,

I have a cluster of 17 nodes in production.(8 and 9 nodes in 2 DC).
Cassandra version: 2.0.11
Client connecting using thrift over port 9160
Jdk version : 1.8.066
GC used : G1GC (16GB heap)
Other GC settings:
Maxgcpausemillis=200
Parallels gc threads=32
Concurrent gc threads= 10
Initiatingheapoccupancypercent=50
Number of cpu cores for each system : 40
Memory size: 185 GB
Read/sec : 300 /sec on each node
Writes/sec : 300/sec on each node
Compaction strategy used : Size tiered compaction strategy

Identified issues in the cluster:
1. Disk space usage across all nodes in the cluster is 80%. We are currently 
working on adding more storage on each node
2. There are 2 tables for which we keep on seeing large number of tombstones. 
One of table has read requests seeing 120 tombstones cells in last 5 mins as 
compared to 4 live cells. Tombstone warns and Error messages of query getting 
aborted is also seen.

Current issue sen:
1. We keep on seeing GC pauses of few minutes randomly across nodes in the 
cluster. GC pauses of 120 seconds, even 770 seconds are also seen.
2. This leads to nodes getting stalled and client seeing direct impact
3. The GC pause we see, are not during any of G1GC phases. The GC log message 
prints “Time to stop threads took 770 seconds”. So it is not the garbage 
collector doing any work but stopping the threads at a safe point is taking so 
much of time.
4. This issue has surfaced recently after we changed 8GB(CMS) to 16GB(G1GC) 
across all nodes in the cluster.

Kindly do help on the above issue. I am not able to exactly understand if the 
GC is wrongly tuned, other if this is something else.

Thanks,
Rajsekhar Mallick



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Reply via email to