Hi,

We run 48 node cluster that stores counts in wide rows. Each node is using
roughly 1TB space on a 2TB EBS gp2 drive for data directory and
LeveledCompactionStrategy. We have been trying to bootstrap new nodes that
use a raid0 configuration over 2 1TB EBS drives to increase I/O throughput
cap from 160 MB/s to 250 MB/s (AWS limits). Every time a node finishes
streaming it is bombarded by a large number of compactions. We see CPU load
on the new node spike extremely high and CPU load on all the other nodes in
the cluster drop unreasonably low. Meanwhile our app's latency for writes
to this cluster average 10 seconds or greater. We've already tried
throttling compaction throughput to 1 mbps and we've always had
concurrent_compactors set to 2 but the disk is still saturated. In every
case we have had to shut down the Cassandra process on the new node to
resume acceptable operations.

We're currently upgrading all of our clients to use the 3.11.0 version of
the DataStax Python driver, which will allow us to add our next newly
bootstrapped node to a blacklist, hoping that if it doesn't accept writes
the rest of the cluster can serve them adequately (as is the case whenever
we turn down the bootstrapping node), and allow it to finish its
compactions.

We were also interested in hearing if anyone has had much luck using the
sstableofflinerelevel tool, and if this is a reasonable approach for our
issue.

One of my colleagues found a post where a user had a similar issue and
found that bloom filters had an extremely high false positive ratio, and
although I didn't check that during any of these attempts to bootstrap it
seems to me like if we have that many compactions to do we're likely to
observe that same thing.

Would appreciate any guidance anyone can offer.

Thanks,
Paul

Reply via email to