The tpstats you posted show that the node is dropping reads and writes
which means that your disk can't keep up with the load meaning your disk is
the bottleneck. If you haven't already, place data and commitlog on
separate disks so they're not competing for the same IO bandwidth. Note
that It's OK to have them on the same disk/volume if you have NVMe SSDs
since it's a lot more difficult to saturate them.

The challenge with monitoring is that typically it's only checking disk
stats every 5 minutes (for example). But your app traffic is bursty in
nature so stats averaged out over a period of time is irrelevant because
the only thing that matters is what the disk IO is at the the time you hit
peak loads.

The dropped reads and mutations tell you the node is overloaded. Provided
your nodes are configured correctly, the only way out of this situation is
to correctly size your cluster and add more nodes -- your cluster needs to
be sized for peak loads, not average throughput. Cheers!

Reply via email to