I had a 5 nodes cluster, then increased to 6, then to 7, then to 8, then back to 7. I installed 3.11.6 back when node_tokens defaulted to 256, so as far as I understand at the expense of long repairs it should have an excellent capacity to scale to new nodes, but I get this status:

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address            Load         Tokens  Owns
UN  node1              1.08 TiB     256     46.4%
UN  node2              1.06 TiB     256     45.8%
UN  node3              1.02 TiB     256     45.1%
UN  node4              1.01 TiB     256     46.6%
UN  node5              994.92 GiB   256     44.0%
UN  node7              1.04 TiB     256     38.1%
UN  node8              882.03 GiB   256     33.9%

(I renamed nodes and sorted them to represent the date they entered the cluster; notice node6 was decommissioned and later replaced by node8)

This is a Prometheus+Grafana graph of the process of population of a new table (created when the cluster was already stable with node8):

https://i.imgur.com/CLDLENU.png

I don't understand why node7 (in blue) and node8 (in red) are way less loaded with data than the others.
(as correctly reported both by "owns" and the graph)
PS: the purple node at the top is the disaster recovery node in a remote location, and is alone instead of being a cluster, so it's right that it has way more load than the others.

I tried summing all the token ranges from "nodetool ring" and they are quite balanced (as expected with 256 virtual tokens, I guess):

% nodetool ring | awk '/^=/ { prev = -1 } /^[0-9]/ { ip = $1; pos = $8; if (prev != -1) host[ip] += pos - prev; prev= pos; } END { tot = 0; for (ip in host) if (ip != "nodeDR") tot += host[ip]; for (ip in host) print host[ip] / tot, ip; }'
0.992797 nodeDR
0.146039 node1
0.148853 node2
0.139175 node3
0.135932 node4
0.140542 node5
0.143875 node7
0.145583 node8
(yes I know it has a slight bias because it doesn't manage correctly the first line, but that's less than 0.8%)

It's true that node8 being newer probably has less "extra data", but after adding it and after waiting for Reaper to repair all tables, I did "nodetool cleanup" on all other nodes, so that shouldn't be it.

Oh, the tables that account for 99.9% of the used space (included the one in the graph above) have millions of records and have a timeuuid inside the partition key, so they should distribute perfectly well among all tokens.

Is there any other reason for the load unbalance I didn't think of?
Is there a way to force things back to normal?

--
Lapo Luchini
l...@lapo.it


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Reply via email to