This is anecdotal but I have never seen one cluster that big. You might want to inquire about professional support which would give you a better perspective for that kind of scale.
On Thu, Jun 24, 2021 at 10:30 AM Eneko Lacunza via pve-user < [email protected]> wrote: > > > > ---------- Forwarded message ---------- > From: Eneko Lacunza <[email protected]> > To: "[email protected]" <[email protected]> > Cc: > Bcc: > Date: Thu, 24 Jun 2021 16:30:31 +0200 > Subject: BIG cluster questions > Hi all, > > We're currently helping a customer to configure a virtualization cluster > with 88 servers for VDI. > > Right know we're testing the feasibility of building just one Proxmox > cluster of 88 nodes. A 4-node cluster has been configured too for > comparing both (same server and networking/racks). > > Nodes have 2 NICs 2x25Gbps each. Currently there are two LACP bonds > configured (one for each NIC); one for storage (NFS v4.2) and the other > for the rest (VMs, cluster). > > Cluster has two rings, one on each bond. > > - With clusters at rest (no significant number of VMs running), we see > quite a different corosync/knet latency average on our 88 node cluster > (~300-400) and our 4-node cluster (<100). > > > For 88-node cluster: > > - Creating some VMs (let's say 16), one each 30s, works well. > - Destroying some VMs (let's say 16), one each 30s, outputs error > messages (storage cfs lock related) and fails removing some of the VMs. > > - Rebooting 32 nodes, one each 30 seconds (boot for a node is about > 120s) so that no quorum is lost, creates a cluster traffic "flood". Some > of the rebooted nodes don't rejoin the cluster, and WUI shows all nodes > in cluster quorum with a grey ?, instead of green OK. In this situation > corosying latency in some nodes can skyrocket to 10s or 100s times the > values before the reboots. Access to pmxcfs is very slow and we have > been able to fix the issue only rebooting all nodes. > > - We have tried changing the transport of knet in a ring from UDP to > SCTP as reported here: > > https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/page-2 > that gives better latencies for corosync, but the reboot issue continues. > > We don't know whether both issues are related or not. > > Could LACP bonds be the issue? > > https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_network_configuration > " > If your switch support the LACP (IEEE 802.3ad) protocol then we > recommend using the corresponding bonding mode (802.3ad). Otherwise you > should generally use the active-backup mode. > If you intend to run your cluster network on the bonding interfaces, > then you have to use active-passive mode on the bonding interfaces, > other modes are unsupported. > " > As per second line, we understand that running cluster networking over a > LACP bond is not supported (just to confirm our interpretation)? We're > in the process of reconfiguring nodes/switches to test without a bond, > to see if that gives us a stable cluster (will report on this). Do you > think this could be the issue? > > > Now for more general questions; do you think a 88-node Proxmox VE > cluster is feasible? > > Those 88 nodes will host about 14.000 VMs. Will HA manager be able to > manage them, or are they too many? (HA for those VMs doesn't seem to be > a requirement right know). > > > Thanks a lot > Eneko > > > EnekoLacunza > > CTO | Zuzendari teknikoa > > Binovo IT Human Project > > 943 569 206 <tel:943 569 206> > > [email protected] <mailto:[email protected]> > > binovo.es <//binovo.es> > > Astigarragako Bidea, 2 - 2 izda. Oficina 10-11, 20180 Oiartzun > > > youtube <https://www.youtube.com/user/CANALBINOVO/> > linkedin <https://www.linkedin.com/company/37269706/> > > > > > ---------- Forwarded message ---------- > From: Eneko Lacunza via pve-user <[email protected]> > To: "[email protected]" <[email protected]> > Cc: Eneko Lacunza <[email protected]> > Bcc: > Date: Thu, 24 Jun 2021 16:30:31 +0200 > Subject: [PVE-User] BIG cluster questions > _______________________________________________ > pve-user mailing list > [email protected] > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > _______________________________________________ pve-user mailing list [email protected] https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
