With a 5s collection, the problem is almost certainly GC. GC pressure can be caused by a number of things, including normal read/write loads, but ALSO compaction calculation (pre-2.1.9 / #9882) and very large partitions (trying to load a very large partition with something like row cache in 2.0 and earlier, or issuing a full row read where the row is larger than you expect).
You can try to tune the GC behavior, but the underlying problem may be something like a bad data model (which Samuel suggested), and no amount of GC tuning is going to fix trying to do bad things with very big rows. From: Roman Tkachenko Reply-To: "user@cassandra.apache.org" Date: Thursday, September 10, 2015 at 10:54 AM To: "user@cassandra.apache.org" Subject: Re: High CPU usage on some of nodes Thanks for the responses guys. I also suspected GC and I guess it could be it, since during the spikes logs are filled with messages like "GC for ConcurrentMarkSweep: 5908 ms for 1 collections, 1986282520 used; max is 8375238656", often right before messages about dropped queries, unlike other, unaffected, nodes that only have "GC for ParNew: 230 ms for 1 collections, 4418571760 used; max is 8375238656" type of messages. Is my best shot to play with JVM settings trying to tune garbage collection then? On Thu, Sep 10, 2015 at 6:52 AM, Samuel CARRIERE <samuel.carri...@urssaf.fr> wrote: Hi Roman, If it affects only a subset of nodes and it's always the same ones, it could be a "problem" with your data model : maybe some (too) wide rows on theses nodes. If one of your row is too wide, the deserialisation of the columns index of this row can take a lot of resources (disk, RAM, and CPU). If you are using leveled compaction strategy and you see anormaly big sstables on thoses nodes, it could be a clue. Regards, Samuel Robert Wille <rwi...@fold3.com> a écrit sur 10/09/2015 15:27:41 : > De : Robert Wille <rwi...@fold3.com> > A : "user@cassandra.apache.org" <user@cassandra.apache.org>, > Date : 10/09/2015 15:30 > Objet : Re: High CPU usage on some of nodes > > It sounds like its probably GC. Grep for GC in system.log to verify. > If it is GC, there are a myriad of issues that could cause it, but > at least you’ve narrowed it down. > > On Sep 9, 2015, at 11:05 PM, Roman Tkachenko <ro...@mailgunhq.com> wrote: > > > Hey guys, > > > > We've been having issues in the past couple of days with CPU usage > / load average suddenly skyrocketing on some nodes of the cluster, > affecting performance significantly so majority of requests start > timing out. It can go on for several hours, with CPU spiking through > the roof then coming back down to norm and so on. Weirdly, it > affects only a subset of nodes and it's always the same ones. The > boxes Cassandra is running on are pretty beefy, 24 cores, and these > CPU spikes go up to >1000%. > > > > What is the best way to debug such kind of issues and find out > what Cassandra is doing during spikes like this? Doesn't seem to be > compaction related as sometimes during these spikes "nodetool > compactionstats" says no compactions are running. > > > > Thanks! > > >
smime.p7s
Description: S/MIME cryptographic signature