Re: underutilized servers

Bowen Song Sat, 06 Mar 2021 02:32:20 -0800

Hi Erick,

Please allow me to disagree on this. A node dropping reads and writesdoesn't always mean the disk is the bottleneck. I have seen the samebehaviour when a node had excessive STW GCs and a lots of timeouts, andI have also seen writes get dropped because the size of the mutationexceeds half of the commit log segment size. I'd like to keep an openmind until it's supported by evidence, so we don't ended up wasting time(and money) on trying to fix an issue that doesn't exist in the first place.



Cheers,

Bowen

On 05/03/2021 23:09, Erick Ramirez wrote:

The tpstats you posted show that the node is dropping reads and writeswhich means that your disk can't keep up with the load meaning yourdisk is the bottleneck. If you haven't already, place data andcommitlog on separate disks so they're not competing for the same IObandwidth. Note that It's OK to have them on the same disk/volume ifyou have NVMe SSDs since it's a lot more difficult to saturate them.
The challenge with monitoring is that typically it's only checkingdisk stats every 5 minutes (for example). But your app traffic isbursty in nature so stats averaged out over a period of time isirrelevant because the only thing that matters is what the disk IO isat the the time you hit peak loads.
The dropped reads and mutations tell you the node is overloaded.Provided your nodes are configured correctly, the only way out of thissituation is to correctly size your cluster and add more nodes -- yourcluster needs to be sized for peak loads, not average throughput. Cheers!

Re: underutilized servers

Reply via email to