Hello,
I have recently migrated our customer's cluster to newer hardware
(CentOS 8 Stream, 4 hypervisor nodes, 3 hosts with GlusterFS 5x 6TB
SSD as JBOD with replica 3). After 1 month from the switch we
encounter frequent vm locks that need host reboot in order to
unlock the VM. Affected vms cannot be powered down from ovirt UI.
Even if ovirt is successful in powering down affected vms, they
cannot be booted again with information that OS disk is used. Once
I reboot the host, vms can be turned on and everything works fine.
In vdsm logs I can note the following error:
2023-05-11 19:33:12,339+0200 ERROR (qgapoller/1)
[virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller
of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at
0x7f553aa3e470>> operation failed (periodic:187)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py",
line 185, in __call__
self._func()
File
"/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py",
line 476, in _poller
vm_id, self._qga_call_get_vcpus(vm_obj))
File
"/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py",
line 797, in _qga_call_get_vcpus
if 'online' in vcpus:
TypeError: argument of type 'NoneType' is not iterable
/var/log/messages reports:
May 11 19:35:15 kernel: task:CPU 7/KVM state:D stack: 0 pid:
7065 ppid: 1 flags: 0x80000182
May 11 19:35:15 kernel: Call Trace:
May 11 19:35:15 kernel: __schedule+0x2d1/0x870
May 11 19:35:15 kernel: schedule+0x55/0xf0
May 11 19:35:15 kernel: schedule_preempt_disabled+0xa/0x10
May 11 19:35:15 kernel: rwsem_down_read_slowpath+0x26e/0x3f0
May 11 19:35:15 kernel: down_read+0x95/0xa0
May 11 19:35:15 kernel: get_user_pages_unlocked+0x66/0x2a0
May 11 19:35:15 kernel: hva_to_pfn+0xf5/0x430 [kvm]
May 11 19:35:15 kernel: kvm_faultin_pfn+0x95/0x2e0 [kvm]
May 11 19:35:15 kernel: ? select_task_rq_fair+0x355/0x990
May 11 19:35:15 kernel: ? sched_clock+0x5/0x10
May 11 19:35:15 kernel: ? sched_clock_cpu+0xc/0xb0
May 11 19:35:15 kernel: direct_page_fault+0x3b4/0x860 [kvm]
May 11 19:35:15 kernel: kvm_mmu_page_fault+0x114/0x680 [kvm]
May 11 19:35:15 kernel: ? vmx_vmexit+0x9f/0x70d [kvm_intel]
May 11 19:35:15 kernel: ? vmx_vmexit+0xae/0x70d [kvm_intel]
May 11 19:35:15 kernel: ?
gfn_to_pfn_cache_invalidate_start+0x190/0x190 [kvm]
May 11 19:35:15 kernel: vmx_handle_exit+0x177/0x770 [kvm_intel]
May 11 19:35:15 kernel: ?
gfn_to_pfn_cache_invalidate_start+0x190/0x190 [kvm]
May 11 19:35:15 kernel: vcpu_enter_guest+0xafd/0x18e0 [kvm]
May 11 19:35:15 kernel: ? hrtimer_try_to_cancel+0x7b/0x100
May 11 19:35:15 kernel: kvm_arch_vcpu_ioctl_run+0x112/0x600 [kvm]
May 11 19:35:15 kernel: kvm_vcpu_ioctl+0x2c9/0x640 [kvm]
May 11 19:35:15 kernel: ? pollwake+0x74/0xa0
May 11 19:35:15 kernel: ? wake_up_q+0x70/0x70
May 11 19:35:15 kernel: ? __wake_up_common+0x7a/0x190
May 11 19:35:15 kernel: do_vfs_ioctl+0xa4/0x690
May 11 19:35:15 kernel: ksys_ioctl+0x64/0xa0
May 11 19:35:15 kernel: __x64_sys_ioctl+0x16/0x20
May 11 19:35:15 kernel: do_syscall_64+0x5b/0x1b0
May 11 19:35:15 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
May 11 19:35:15 kernel: RIP: 0033:0x7faf1a1387cb
May 11 19:35:15 kernel: Code: Unable to access opcode bytes at RIP
0x7faf1a1387a1.
May 11 19:35:15 kernel: RSP: 002b:00007fa6f5ffa6e8 EFLAGS: 00000246
ORIG_RAX: 0000000000000010
May 11 19:35:15 kernel: RAX: ffffffffffffffda RBX: 000055be52e7bcf0
RCX: 00007faf1a1387cb
May 11 19:35:15 kernel: RDX: 0000000000000000 RSI: 000000000000ae80
RDI: 0000000000000027
May 11 19:35:15 kernel: RBP: 0000000000000000 R08: 000055be5158c6a8
R09: 00000007d9e95a00
May 11 19:35:15 kernel: R10: 0000000000000002 R11: 0000000000000246
R12: 0000000000000000
May 11 19:35:15 kernel: R13: 000055be515bcfc0 R14: 00007fffec958800
R15: 00007faf1d6c6000
May 11 19:35:15 kernel: INFO: task worker:714626 blocked for more
than 120 seconds.
May 11 19:35:15 kernel: Not tainted 4.18.0-489.el8.x86_64 #1
May 11 19:35:15 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 11 19:35:15 kernel: task:worker state:D stack: 0
pid:714626 ppid: 1 flags:0x00000180
May 11 19:35:15 kernel: Call Trace:
May 11 19:35:15 kernel: __schedule+0x2d1/0x870
May 11 19:35:15 kernel: schedule+0x55/0xf0
May 11 19:35:15 kernel: schedule_preempt_disabled+0xa/0x10
May 11 19:35:15 kernel: rwsem_down_read_slowpath+0x26e/0x3f0
May 11 19:35:15 kernel: down_read+0x95/0xa0
May 11 19:35:15 kernel: do_madvise.part.30+0x2c3/0xa40
May 11 19:35:15 kernel: ? syscall_trace_enter+0x1ff/0x2d0
May 11 19:35:15 kernel: ? __x64_sys_madvise+0x26/0x30
May 11 19:35:15 kernel: __x64_sys_madvise+0x26/0x30
May 11 19:35:15 kernel: do_syscall_64+0x5b/0x1b0
May 11 19:35:15 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
May 11 19:35:15 kernel: RIP: 0033:0x7faf1a138a4b
May 11 19:35:15 kernel: Code: Unable to access opcode bytes at RIP
0x7faf1a138a21.
May 11 19:35:15 kernel: RSP: 002b:00007faf151ea7f8 EFLAGS: 00000206
ORIG_RAX: 000000000000001c
May 11 19:35:15 kernel: RAX: ffffffffffffffda RBX: 00007faf149eb000
RCX: 00007faf1a138a4b
May 11 19:35:15 kernel: RDX: 0000000000000004 RSI: 00000000007fb000
RDI: 00007faf149eb000
May 11 19:35:15 kernel: RBP: 0000000000000000 R08: 00000007faf080ba
R09: 00000000ffffffff
May 11 19:35:15 kernel: R10: 00007faf151ea760 R11: 0000000000000206
R12: 00007faf15aec48e
May 11 19:35:15 kernel: R13: 00007faf15aec48f R14: 00007faf151eb700
R15: 00007faf151ea8c0
May 11 19:35:15 kernel: INFO: task worker:714628 blocked for more
than 120 seconds.
May 11 19:35:15 kernel: Not tainted 4.18.0-489.el8.x86_64 #1
May 11 19:35:15 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Installed VDSM packages:
vdsm-api-4.50.3.4-1.el8.noarch
vdsm-network-4.50.3.4-1.el8.x86_64
vdsm-yajsonrpc-4.50.3.4-1.el8.noarch
vdsm-http-4.50.3.4-1.el8.noarch
vdsm-client-4.50.3.4-1.el8.noarch
vdsm-4.50.3.4-1.el8.x86_64
vdsm-gluster-4.50.3.4-1.el8.x86_64
vdsm-python-4.50.3.4-1.el8.noarch
vdsm-jsonrpc-4.50.3.4-1.el8.noarch
vdsm-common-4.50.3.4-1.el8.noarch
Libvirt:
libvirt-client-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-nodedev-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-logical-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-network-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-qemu-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-scsi-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-core-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-config-network-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-iscsi-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-rbd-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-libs-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-config-nwfilter-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-secret-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-disk-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-mpath-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-gluster-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
python3-libvirt-8.0.0-2.module_el8.7.0+1218+f626c2ff.x86_64
libvirt-daemon-driver-nwfilter-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-lock-sanlock-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-interface-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-iscsi-direct-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-kvm-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
During the issue with locked vms, they do not respond from network,
I cannot use VNC console (or any other) to check what is happening
from VM perspective. The host cannot list running processes. There
are plenty of resources left and each host runs about 30-35 vms.
First I thought that it might be related to glusterfs (I use
gluster on other clusters and it usually works fine), so we
migrated all vms back to old storage (NFS). The problem came back
today on two hosts. I do not have such issues on other cluster
which runs on Rocky 8.6 with hyperconverged glusterfs. Hence as a
last resort I'll be migrating to Rocky 8 from CentOS 8-stream.
Has anyone observed such issues with ovirt hosts on CentOS
8-stream? Any form of help is welcome, as I'm running out of ideas.
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/2C7MAUFIUW6K5IJOIJ774PGNRAHSDIBL/