Re: [PVE-User] BIG cluster questions

Eneko Lacunza via pve-user Fri, 25 Jun 2021 01:06:47 -0700

--- Begin Message ---
Hi,
We have tested without bonding, same issues.

El 24/6/21 a las 16:30, Eneko Lacunza escribió:
Hi all,
We're currently helping a customer to configure a virtualizationcluster with 88 servers for VDI.
Right know we're testing the feasibility of building just one Proxmoxcluster of 88 nodes. A 4-node cluster has been configured too forcomparing both (same server and networking/racks).
Nodes have 2 NICs 2x25Gbps each. Currently there are two LACP bondsconfigured (one for each NIC); one for storage (NFS v4.2) and theother for the rest (VMs, cluster).
Cluster has two rings, one on each bond.
- With clusters at rest (no significant number of VMs running), we seequite a different corosync/knet latency average on our 88 node cluster(~300-400) and our 4-node cluster (<100).
For 88-node cluster:

- Creating some VMs (let's say 16), one each 30s, works well.
- Destroying some VMs (let's say 16), one each 30s, outputs errormessages (storage cfs lock related) and fails removing some of the VMs.
- Rebooting 32 nodes, one each 30 seconds (boot for a node is about120s) so that no quorum is lost, creates a cluster traffic "flood".Some of the rebooted nodes don't rejoin the cluster, and WUI shows allnodes in cluster quorum with a grey ?, instead of green OK. In thissituation corosying latency in some nodes can skyrocket to 10s or 100stimes the values before the reboots. Access to pmxcfs is very slow andwe have been able to fix the issue only rebooting all nodes.
- We have tried changing the transport of knet in a ring from UDP toSCTP as reported here:
https://forum.proxmox.com/threads/proxmox-6-2-corosync-3-rare-and-spontaneous-disruptive-udp-5405-storm-flood.75871/page-2
that gives better latencies for corosync, but the reboot issue continues.

We don't know whether both issues are related or not.

Could LACP bonds be the issue?
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_network_configuration
"
If your switch support the LACP (IEEE 802.3ad) protocol then werecommend using the corresponding bonding mode (802.3ad). Otherwiseyou should generally use the active-backup mode.If you intend to run your cluster network on the bonding interfaces,then you have to use active-passive mode on the bonding interfaces,other modes are unsupported.
"
As per second line, we understand that running cluster networking overa LACP bond is not supported (just to confirm our interpretation)?We're in the process of reconfiguring nodes/switches to test without abond, to see if that gives us a stable cluster (will report on this).Do you think this could be the issue?
Now for more general questions; do you think a 88-node Proxmox VEcluster is feasible?
Those 88 nodes will host about 14.000 VMs. Will HA manager be able tomanage them, or are they too many? (HA for those VMs doesn't seem tobe a requirement right know).
Thanks a lot
Eneko
     EnekoLacunza

CTO | Zuzendari teknikoa

Binovo IT Human Project

        943 569 206 <tel:943 569 206>

        [email protected] <mailto:[email protected]>

        binovo.es <//binovo.es>

        Astigarragako Bidea, 2 - 2 izda. Oficina 10-11, 20180 Oiartzun

        
youtube <https://www.youtube.com/user/CANALBINOVO/>       
        linkedin <https://www.linkedin.com/company/37269706/>     
--- End Message ---

_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] BIG cluster questions

Reply via email to