Hi all!

We run a 1.5PB cluster with 12 hosts, 192 OSDs (mix of NVMe and HDD) and need 
to improve our failure domain by altering the crush rules and moving rack to 
pods, which would imply a lot of data movement.

I wonder what would the preferred order of operations be when doing such 
changes to the crush map and pools? Will there be minimal data movement by 
moving all racks to pods at once and change pool repl rules or is the best 
approach to first move racks one by one to pods and then change pool 
replication rules from rack to pods? Anyhow I guess it's good practice to set 
'norebalance' before moving hosts and unset to start the actual moving?

Right now we have the following setup:

root -> rack2 -> ups1 + node51 + node57 + switch21
root -> rack3 -> ups2 + node52 + node58 + switch22
root -> rack4 -> ups3 + node53 + node59 + switch23
root -> rack5 -> ups4 + node54 + node60 -- switch 21 ^^
root -> rack6 -> ups5 + node55 + node61 -- switch 22 ^^
root -> rack7 -> ups6 + node56 + node62 -- switch 23 ^^

Note that racks 5-7 are connected to same ToR switches as racks 2-4. Cluster 
and frontend network are in different VXLANs connected with dual 40GbE. Failure 
domain for 3x replicated pools are currently by rack, and after adding hosts 
57-62 we realized that if one of the switches reboots or fails, replicated PGs 
located only on those 4 hosts will be unavailable and force pools offline. I 
guess the best way would instead like to organize the racks in pods like this:

root -> pod1 -> rack2 -> ups1 + node51 + node57
root -> pod1 -> rack5 -> ups4 + node54 + node60 -> switch21
root -> pod2 -> rack3 -> ups2 + node52 + node58
root -> pod2 -> rack6 -> ups5 + node55 + node61 -> switch 22
root -> pod3 -> rack4 -> ups3 + node53 + node59
root -> pod3 -> rack7 -> ups6 + node56 + node62 -> switch 23

The reason for this arrangement is that we in the future plan to organize the 
pods in different buildings. We're running nautilus 14.2.16 and are about to 
upgrade to Octopus. Should we upgrade to Octopus before crush changes? 

Any thoughts or insight on how to achieve this with minimal data movement and 
risk of cluster downtime would be welcome!


--thomas

--
Thomas Hukkelberg
tho...@hovedkvarteret.no
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to