--- Begin Message ---
Hi all,

We have split the BIG 88 node cluster in 6 clusters of 15 nodes each (there where some spare servers); now things seem much better :)

Sadly, we are seeing some issues when VDI management system (USD Enterprise) is performing mass (in the order of 100s or even 1000s) destruction and creation of VMs. In a fraction of the clone operations, clone will fail with the following message:

"Error: clone failed. Failed to change directory to '/mnt/pve/vdi-prod1/images/103': No such file or directory at /usr/share/perl5/PVE/Storage/Plugin.pm line 708."

This happens when destroy for that VMID was some seconds before (5s-14s for example). When another clone tries to use that VMID later (as soon as 54s after destruction), it works ok.

PVE version is 6.4 ISO (details below), and storage is NFS 4.2 with pNFS with two pairs of NetApp servers in HA.

Seems like a "race condition" is happening, where the node that is cloning sees the storage directory removed by destruction late (?).

I have checked "qemu-server.git/PVE/QemuServer.pm:sub destroy_vm" and I see first storage disk are freed and after that VM config is removed, which seems quite correct. Could it be the NFS servers that are a bit "late" propagating directory removal to the client nodes?

Any ideas?

Thanks

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/



--- End Message ---
_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Reply via email to