How to debug node load unbalance

Lapo Luchini Tue, 02 Mar 2021 23:27:43 -0800

I had a 5 nodes cluster, then increased to 6, then to 7, then to 8, thenback to 7. I installed 3.11.6 back when node_tokens defaulted to 256, soas far as I understand at the expense of long repairs it should have anexcellent capacity to scale to new nodes, but I get this status:


Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address            Load         Tokens  Owns
UN  node1              1.08 TiB     256     46.4%
UN  node2              1.06 TiB     256     45.8%
UN  node3              1.02 TiB     256     45.1%
UN  node4              1.01 TiB     256     46.6%
UN  node5              994.92 GiB   256     44.0%
UN  node7              1.04 TiB     256     38.1%
UN  node8              882.03 GiB   256     33.9%

(I renamed nodes and sorted them to represent the date they entered thecluster; notice node6 was decommissioned and later replaced by node8)

This is a Prometheus+Grafana graph of the process of population of a newtable (created when the cluster was already stable with node8):


https://i.imgur.com/CLDLENU.png

I don't understand why node7 (in blue) and node8 (in red) are way lessloaded with data than the others.

(as correctly reported both by "owns" and the graph)

PS: the purple node at the top is the disaster recovery node in a remotelocation, and is alone instead of being a cluster, so it's right that ithas way more load than the others.

I tried summing all the token ranges from "nodetool ring" and they arequite balanced (as expected with 256 virtual tokens, I guess):

% nodetool ring | awk '/^=/ { prev = -1 } /^[0-9]/ { ip = $1; pos = $8;if (prev != -1) host[ip] += pos - prev; prev= pos; } END { tot = 0; for(ip in host) if (ip != "nodeDR") tot += host[ip]; for (ip in host) printhost[ip] / tot, ip; }'

0.992797 nodeDR
0.146039 node1
0.148853 node2
0.139175 node3
0.135932 node4
0.140542 node5
0.143875 node7
0.145583 node8

(yes I know it has a slight bias because it doesn't manage correctly thefirst line, but that's less than 0.8%)

It's true that node8 being newer probably has less "extra data", butafter adding it and after waiting for Reaper to repair all tables, I did"nodetool cleanup" on all other nodes, so that shouldn't be it.

Oh, the tables that account for 99.9% of the used space (included theone in the graph above) have millions of records and have a timeuuidinside the partition key, so they should distribute perfectly well amongall tokens.


Is there any other reason for the load unbalance I didn't think of?
Is there a way to force things back to normal?

--
Lapo Luchini
l...@lapo.it


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

How to debug node load unbalance

Reply via email to