Are you running repairs ? You may try: - increase concurrentçcompaction to 8 (max in 2.1.x) - increase compaction_throupghput to more than 16MB/s (48 may be a good start)
What kind of data are you storing in theses tables ? timeseries ? 2016-03-21 23:37 GMT+01:00 Gianluca Borello <gianl...@sysdig.com>: > Thank you for your reply, definitely appreciate the tip on the compressed > size. > > I understand your point, in fact whenever we bootstrap a new node we see a > huge number of pending compactions (in the order of thousands), and they > usually decrease steadily until they reach 0 in just a few hours. With this > node, however, we are way beyond that point, it has been 3 days since the > number of pending compaction started fluctuating around ~150 without any > sign of going down (I can see from Opscenter it's almost a straight line > starting a few hours after the bootstrap). In particular, to reply to your > point: > > - The number of sstables for this CF on this node is around 250, which is in > the same range of all the other nodes in the cluster (I counted the number > on each one of them, and every node is in the 200-400 range) > > - This theory doesn't seem to explain why, when doing "nodetool drain", the > compactions completely stop after a few minutes and I get something such as: > > $ nodetool compactionstats > pending tasks: 128 > > So no compactions being executed (since there is no more write activity), > but the pending number is still high. > > Thanks again > > > On Mon, Mar 21, 2016 at 3:19 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> > wrote: >> >> > We added a bunch of new nodes to a cluster (2.1.13) and everything went >> > fine, except for the number of pending compactions that is staying quite >> > high on a subset of the new nodes. Over the past 3 days, the pending >> > compactions have never been less than ~130 on such nodes, with peaks of >> > ~200. >> >> When you bootstrap with Vnodes, you end up with thousands (or tens of >> thousands) of sstables – with 256 Vnodes (default) * 20 sstables per node, >> your resulting node will have 5k sstables. It takes quite a while for >> compaction to chew through that. If you added a bunch of nodes in sequence, >> you’d have 5k on the first node, then potentailly 10k on the next, and could >> potentially keep increasing as you start streaming from nodes that have way >> too many sstables. This is one of the reasons that many people who have to >> grow their clusters frequently try not to use vnodes. >> >> From your other email: >> >> > Also related to this point, now I'm seeing something even more odd: some >> > compactions are way bigger than the size of the column family itself, such >> > as: >> >> The size reported by compactionstats is the uncompressed size – if you’re >> using compression, it’s perfectly reasonable for 30G of data to show up as >> 118G of data during compaction. >> >> - Jeff >> >> From: Gianluca Borello >> Reply-To: "user@cassandra.apache.org" >> Date: Monday, March 21, 2016 at 12:50 PM >> To: "user@cassandra.apache.org" >> Subject: Pending compactions not going down on some nodes of the cluster >> >> Hi, >> >> We added a bunch of new nodes to a cluster (2.1.13) and everything went >> fine, except for the number of pending compactions that is staying quite >> high on a subset of the new nodes. Over the past 3 days, the pending >> compactions have never been less than ~130 on such nodes, with peaks of >> ~200. On the other nodes, they correctly fluctuate between 0 and ~20, which >> has been our norm for a long time. >> >> We are quite paranoid about pending compactions because in the past such >> high number caused a lot of data being brought in memory during some reads >> and that triggered a chain reaction of full GCs that brought down our >> cluster, so we try to monitor them closely. >> >> Some data points that should let the situation speak for itself: >> >> - We use LCS for all our column families >> >> - The cluster is operating absolutely fine and seems healthy, and every >> node is handling pretty much the same load in terms of reads and writes. >> Also, these nodes with higher pending compactions don't seem in any way >> performing worse than the others >> >> - The pending compactions don't go down even when setting the compaction >> throughput to unlimited for a very long time >> >> - This is the typical output of compactionstats and tpstats: >> >> $ nodetool compactionstats >> pending tasks: 137 >> compaction type keyspace table completed total >> unit progress >> Compaction draios message_data60 6111208394 6939536890 >> bytes 88.06% >> Compaction draios message_data1 26473390790 37243294809 >> bytes 71.08% >> Active compaction remaining time : n/a >> >> $ nodetool tpstats >> Pool Name Active Pending Completed Blocked >> All time blocked >> CounterMutationStage 0 0 0 0 >> 0 >> ReadStage 1 0 111766844 0 >> 0 >> RequestResponseStage 0 0 244259493 0 >> 0 >> MutationStage 0 0 163268653 0 >> 0 >> ReadRepairStage 0 0 8933323 0 >> 0 >> GossipStage 0 0 363003 0 >> 0 >> CacheCleanupExecutor 0 0 0 0 >> 0 >> AntiEntropyStage 0 0 0 0 >> 0 >> MigrationStage 0 0 2 0 >> 0 >> Sampler 0 0 0 0 >> 0 >> ValidationExecutor 0 0 0 0 >> 0 >> CommitLogArchiver 0 0 0 0 >> 0 >> MiscStage 0 0 0 0 >> 0 >> MemtableFlushWriter 0 0 32644 0 >> 0 >> MemtableReclaimMemory 0 0 32644 0 >> 0 >> PendingRangeCalculator 0 0 527 0 >> 0 >> MemtablePostFlush 0 0 36565 0 >> 0 >> CompactionExecutor 2 70 108621 0 >> 0 >> InternalResponseStage 0 0 0 0 >> 0 >> HintedHandoff 0 0 10 0 >> 0 >> Native-Transport-Requests 6 0 188996929 0 >> 79122 >> >> Message type Dropped >> RANGE_SLICE 0 >> READ_REPAIR 0 >> PAGED_RANGE 0 >> BINARY 0 >> READ 0 >> MUTATION 0 >> _TRACE 0 >> REQUEST_RESPONSE 0 >> COUNTER_MUTATION 0 >> >> - If I do a nodetool drain on such nodes, and then wait for a while, the >> number of pending compactions stays high even if there are no compactions >> being executed anymore and the node is completely idle: >> >> $ nodetool compactionstats >> pending tasks: 128 >> >> - It's also interesting to notice how the compaction in the previous >> example is trying to compact ~37 GB, which is essentially the whole size of >> the column family message_data1 as reported by cfstats: >> >> $ nodetool cfstats -H draios.message_data1 >> Keyspace: draios >> Read Count: 208168 >> Read Latency: 2.4791508685292647 ms. >> Write Count: 502529 >> Write Latency: 0.20701542000561163 ms. >> Pending Flushes: 0 >> Table: message_data1 >> SSTable count: 261 >> SSTables in each level: [43/4, 92/10, 125/100, 0, 0, 0, 0, 0, 0] >> Space used (live): 36.98 GB >> Space used (total): 36.98 GB >> Space used by snapshots (total): 0 bytes >> Off heap memory used (total): 36.21 MB >> SSTable Compression Ratio: 0.15461126176169512 >> Number of keys (estimate): 101025 >> Memtable cell count: 229344 >> Memtable data size: 82.4 MB >> Memtable off heap memory used: 0 bytes >> Memtable switch count: 83 >> Local read count: 208225 >> Local read latency: 2.479 ms >> Local write count: 502581 >> Local write latency: 0.208 ms >> Pending flushes: 0 >> Bloom filter false positives: 11497 >> Bloom filter false ratio: 0.04307 >> Bloom filter space used: 94.97 KB >> Bloom filter off heap memory used: 92.93 KB >> Index summary off heap memory used: 57.88 KB >> Compression metadata off heap memory used: 36.06 MB >> Compacted partition minimum bytes: 447 bytes >> Compacted partition maximum bytes: 34.48 MB >> Compacted partition mean bytes: 1.51 MB >> Average live cells per slice (last five minutes): 26.269698643294515 >> Maximum live cells per slice (last five minutes): 100.0 >> Average tombstones per slice (last five minutes): 0.0 >> Maximum tombstones per slice (last five minutes): 0.0 >> >> - There are no warnings or errors in the log, even after a clean restart >> >> - Restarting the node doesn't seem to have any effect on the number of >> pending compactions >> >> Any help would be very appreciated. >> >> Thank you for reading >> > -- Close the World, Open the Net http://www.linux-wizard.net