[PVE-User] qemu-server / qmp issues: One VM cannot complete backup and freezes

iztok Gregori Wed, 05 Jan 2022 00:11:06 -0800

Hi to all!

Starting from one week, when we added new nodes to the cluster andupgrade all to the latest proxmox 6.4 (with the ultimate goal to upgradeall the nodes to 7.1 in the not-so-near future), *one* of the VM stoppedto backup. The backup job was blocked, and once we manually terminatedthe VM freezed, only a hard poweroff/poweron resumed the VM.


In the logs we have a lot of the following:

VM 0000 qmp command failed - VM 0000 qmp command 'query-proxmox-support'failed - unable to connect to VM 0000 qmp socket - timeout after 31 retries

I searched for it and I found multiple threads on the forum so, in someform, is a known issue, but I'm curious what was the trigger and whatcould we do to work-around that problem (apart upgrade to PVE 7.1 whichwe will, but not this week).


Can you give me some advice?

To summarize the work we did last week (from when the backup stoppedworking):


- Did full upgrade on all the cluster nodes and reboot them.
- Upgrade CEPH from Nautilus to Octopus.
- Install new CEPH OSDs on the new nodes (8 out of 16).

The problematic VM was running (when it wasn't problematic) on one nodewhich (at that moment) wasn't part of the CEPH cluster (but the storagewas, and still is, allways CEPH). We migrated it on a different node buthad the same issues. The VM has 12 RBD disk (which is a lot more thatthe cluster average) and all the disks are backupped on a NFS share.

Because the problem is *only* on that particular VM, I could split it in2 VMs and rearrange the number of disks (to be more in line with thecluster average), or I could rush to upgrade to 7.1 (hopping that theproblem is only on 6.4 PVE...).


Here is the conf:

agent: 1
bootdisk: virtio0
cores: 4
ide2: none,media=cdrom
memory: 4096
name: problematic-vm
net0: virtio=A2:69:F4:8C:38:22,bridge=vmbr0,tag=000
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=8bd477be-69ac-4b51-9c5a-a149f96da521
sockets: 1
virtio0: rbd_vm:vm-1043-disk-0,size=8G
virtio1: rbd_vm:vm-1043-disk-1,size=100G
virtio10: rbd_vm:vm-1043-disk-10,size=30G
virtio11: rbd_vm:vm-1043-disk-11,size=100G
virtio12: rbd_vm:vm-1043-disk-12,size=200G
virtio2: rbd_vm:vm-1043-disk-2,size=100G
virtio3: rbd_vm:vm-1043-disk-3,size=20G
virtio4: rbd_vm:vm-1043-disk-4,size=20G
virtio5: rbd_vm:vm-1043-disk-5,size=30G
virtio6: rbd_vm:vm-1043-disk-6,size=100G
virtio7: rbd_vm:vm-1043-disk-7,size=200G
virtio8: rbd_vm:vm-1043-disk-8,size=20G
virtio9: rbd_vm:vm-1043-disk-9,size=20G


The VM is a CENTOS 7 NFS server.

The CEPH cluster health is OK:

  cluster:
    id:     645e8181-8424-41c4-9bc9-7e37b740e9a9
    health: HEALTH_OK

services:

    mon: 5 daemons, quorum node-01,node-02,node-03,node-05,node-07 (age 8d)
    mgr: node-01(active, since 8d), standbys: node-03, node-02, node-07, node-05
    osd: 120 osds: 120 up (since 6d), 120 in (since 6d)

task status:data:

    pools:   3 pools, 1057 pgs
    objects: 4.65M objects, 17 TiB
    usage:   67 TiB used, 139 TiB / 207 TiB avail
    pgs:     1056 active+clean
             1    active+clean+scrubbing+deep



All of the nodes have the same PVE version:

proxmox-ve: 6.4-1 (running kernel: 5.4.157-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-11
pve-kernel-helper: 6.4-11
pve-kernel-5.4.157-1-pve: 5.4.157-1
pve-kernel-5.4.140-1-pve: 5.4.140-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph: 15.2.15-pve1~bpo10
ceph-fuse: 15.2.15-pve1~bpo10
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.6-pve1~bpo10+1


I can provide more informations if it is necessary.

Cheers
Iztok

_______________________________________________
pve-user mailing list
[email protected]
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user

[PVE-User] qemu-server / qmp issues: One VM cannot complete backup and freezes

Reply via email to