[ovirt-users] Re: VM causes CPU blocks and forces reboot of host

Rik Theys Tue, 23 May 2023 11:28:34 -0700

Hi,

I got the following feedback on my bug report(https://bugzilla.redhat.com/show_bug.cgi?id=2208878):


===

For RHEL-8.8, you should wait for the 1st Z-stream batch (GA around
2023-06-27),
as the issue is being dealt with by backport work via Bug 2189629
(I'm closing this ticket as a DUP of that)

In CentOS-Stream-8, please update your kernel to kernel-4.18.0-492.el8, or
newer.

*** This bug has been marked as a duplicate of bug 2189629 ***

===

So it should be fixed in the 492 kernel in CentOS Stream, and will befixed in a month in RHEL 8.8


Regards,

Rik


On 5/21/23 23:25, John wrote:

Same here, very frustrating.

On 21/05/2023 21:11, Rik Theys wrote:
Hi,
We are experiencing the same issue. We've migrated one host fromCentOS Stream 8 to Rocky 8.8 and now see the same issue with the EL8.8 kernel.
We don't see this issue on our 8.7 hosts.

Regards,

Rik

On 5/15/23 22:48, Jeff Bailey wrote:
Sounds exactly like some trouble I was having. I downgraded thekernel to 4.18.0-448 and everything is fine. There have been acouple of kernel releases since I had problems but I haven't had achance to try them yet. I believe it was in 4.18.0-485 that Inoticed it but that's just from memory.
On 5/11/2023 2:26 PM, dominik.dra...@blackrack.pl wrote:
Hello,
I have recently migrated our customer's cluster to newer hardware(CentOS 8 Stream, 4 hypervisor nodes, 3 hosts with GlusterFS 5x 6TBSSD as JBOD with replica 3). After 1 month from the switch weencounter frequent vm locks that need host reboot in order tounlock the VM. Affected vms cannot be powered down from ovirt UI.Even if ovirt is successful in powering down affected vms, theycannot be booted again with information that OS disk is used. OnceI reboot the host, vms can be turned on and everything works fine.
In vdsm logs I can note the following error:
2023-05-11 19:33:12,339+0200 ERROR (qgapoller/1)[virt.periodic.Operation] <bound method QemuGuestAgentPoller._pollerof <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at0x7f553aa3e470>> operation failed (periodic:187)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py",line 185, in __call__
     self._func()
File"/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py",line 476, in _poller
     vm_id, self._qga_call_get_vcpus(vm_obj))
File"/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py",line 797, in _qga_call_get_vcpus
     if 'online' in vcpus:
TypeError: argument of type 'NoneType' is not iterable

/var/log/messages reports:
May 11 19:35:15 kernel: task:CPU 7/KVM state:D stack: 0 pid:7065 ppid: 1 flags: 0x80000182
May 11 19:35:15 kernel: Call Trace:
May 11 19:35:15 kernel: __schedule+0x2d1/0x870
May 11 19:35:15 kernel: schedule+0x55/0xf0
May 11 19:35:15 kernel: schedule_preempt_disabled+0xa/0x10
May 11 19:35:15 kernel: rwsem_down_read_slowpath+0x26e/0x3f0
May 11 19:35:15 kernel: down_read+0x95/0xa0
May 11 19:35:15 kernel: get_user_pages_unlocked+0x66/0x2a0
May 11 19:35:15 kernel: hva_to_pfn+0xf5/0x430 [kvm]
May 11 19:35:15 kernel: kvm_faultin_pfn+0x95/0x2e0 [kvm]
May 11 19:35:15 kernel: ? select_task_rq_fair+0x355/0x990
May 11 19:35:15 kernel: ? sched_clock+0x5/0x10
May 11 19:35:15 kernel: ? sched_clock_cpu+0xc/0xb0
May 11 19:35:15 kernel: direct_page_fault+0x3b4/0x860 [kvm]
May 11 19:35:15 kernel: kvm_mmu_page_fault+0x114/0x680 [kvm]
May 11 19:35:15 kernel: ? vmx_vmexit+0x9f/0x70d [kvm_intel]
May 11 19:35:15 kernel: ? vmx_vmexit+0xae/0x70d [kvm_intel]
May 11 19:35:15 kernel: ?gfn_to_pfn_cache_invalidate_start+0x190/0x190 [kvm]
May 11 19:35:15 kernel: vmx_handle_exit+0x177/0x770 [kvm_intel]
May 11 19:35:15 kernel: ?gfn_to_pfn_cache_invalidate_start+0x190/0x190 [kvm]
May 11 19:35:15 kernel: vcpu_enter_guest+0xafd/0x18e0 [kvm]
May 11 19:35:15 kernel: ? hrtimer_try_to_cancel+0x7b/0x100
May 11 19:35:15 kernel: kvm_arch_vcpu_ioctl_run+0x112/0x600 [kvm]
May 11 19:35:15 kernel: kvm_vcpu_ioctl+0x2c9/0x640 [kvm]
May 11 19:35:15 kernel: ? pollwake+0x74/0xa0
May 11 19:35:15 kernel: ? wake_up_q+0x70/0x70
May 11 19:35:15 kernel: ? __wake_up_common+0x7a/0x190
May 11 19:35:15 kernel: do_vfs_ioctl+0xa4/0x690
May 11 19:35:15 kernel: ksys_ioctl+0x64/0xa0
May 11 19:35:15 kernel: __x64_sys_ioctl+0x16/0x20
May 11 19:35:15 kernel: do_syscall_64+0x5b/0x1b0
May 11 19:35:15 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
May 11 19:35:15 kernel: RIP: 0033:0x7faf1a1387cb
May 11 19:35:15 kernel: Code: Unable to access opcode bytes at RIP0x7faf1a1387a1.May 11 19:35:15 kernel: RSP: 002b:00007fa6f5ffa6e8 EFLAGS: 00000246ORIG_RAX: 0000000000000010May 11 19:35:15 kernel: RAX: ffffffffffffffda RBX: 000055be52e7bcf0RCX: 00007faf1a1387cbMay 11 19:35:15 kernel: RDX: 0000000000000000 RSI: 000000000000ae80RDI: 0000000000000027May 11 19:35:15 kernel: RBP: 0000000000000000 R08: 000055be5158c6a8R09: 00000007d9e95a00May 11 19:35:15 kernel: R10: 0000000000000002 R11: 0000000000000246R12: 0000000000000000May 11 19:35:15 kernel: R13: 000055be515bcfc0 R14: 00007fffec958800R15: 00007faf1d6c6000May 11 19:35:15 kernel: INFO: task worker:714626 blocked for morethan 120 seconds.
May 11 19:35:15 kernel:      Not tainted 4.18.0-489.el8.x86_64 #1
May 11 19:35:15 kernel: "echo 0 >/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 11 19:35:15 kernel: task:worker state:D stack: 0pid:714626 ppid: 1 flags:0x00000180
May 11 19:35:15 kernel: Call Trace:
May 11 19:35:15 kernel: __schedule+0x2d1/0x870
May 11 19:35:15 kernel: schedule+0x55/0xf0
May 11 19:35:15 kernel: schedule_preempt_disabled+0xa/0x10
May 11 19:35:15 kernel: rwsem_down_read_slowpath+0x26e/0x3f0
May 11 19:35:15 kernel: down_read+0x95/0xa0
May 11 19:35:15 kernel: do_madvise.part.30+0x2c3/0xa40
May 11 19:35:15 kernel: ? syscall_trace_enter+0x1ff/0x2d0
May 11 19:35:15 kernel: ? __x64_sys_madvise+0x26/0x30
May 11 19:35:15 kernel: __x64_sys_madvise+0x26/0x30
May 11 19:35:15 kernel: do_syscall_64+0x5b/0x1b0
May 11 19:35:15 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
May 11 19:35:15 kernel: RIP: 0033:0x7faf1a138a4b
May 11 19:35:15 kernel: Code: Unable to access opcode bytes at RIP0x7faf1a138a21.May 11 19:35:15 kernel: RSP: 002b:00007faf151ea7f8 EFLAGS: 00000206ORIG_RAX: 000000000000001cMay 11 19:35:15 kernel: RAX: ffffffffffffffda RBX: 00007faf149eb000RCX: 00007faf1a138a4bMay 11 19:35:15 kernel: RDX: 0000000000000004 RSI: 00000000007fb000RDI: 00007faf149eb000May 11 19:35:15 kernel: RBP: 0000000000000000 R08: 00000007faf080baR09: 00000000ffffffffMay 11 19:35:15 kernel: R10: 00007faf151ea760 R11: 0000000000000206R12: 00007faf15aec48eMay 11 19:35:15 kernel: R13: 00007faf15aec48f R14: 00007faf151eb700R15: 00007faf151ea8c0May 11 19:35:15 kernel: INFO: task worker:714628 blocked for morethan 120 seconds.
May 11 19:35:15 kernel:      Not tainted 4.18.0-489.el8.x86_64 #1
May 11 19:35:15 kernel: "echo 0 >/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Installed VDSM packages:
vdsm-api-4.50.3.4-1.el8.noarch
vdsm-network-4.50.3.4-1.el8.x86_64
vdsm-yajsonrpc-4.50.3.4-1.el8.noarch
vdsm-http-4.50.3.4-1.el8.noarch
vdsm-client-4.50.3.4-1.el8.noarch
vdsm-4.50.3.4-1.el8.x86_64
vdsm-gluster-4.50.3.4-1.el8.x86_64
vdsm-python-4.50.3.4-1.el8.noarch
vdsm-jsonrpc-4.50.3.4-1.el8.noarch
vdsm-common-4.50.3.4-1.el8.noarch

Libvirt:
libvirt-client-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-nodedev-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64libvirt-daemon-driver-storage-logical-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-network-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64libvirt-daemon-driver-qemu-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64libvirt-daemon-driver-storage-scsi-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64libvirt-daemon-driver-storage-core-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64libvirt-daemon-config-network-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64libvirt-daemon-driver-storage-iscsi-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64libvirt-daemon-driver-storage-rbd-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64libvirt-daemon-driver-storage-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-libs-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-config-nwfilter-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64libvirt-daemon-driver-secret-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64libvirt-daemon-driver-storage-disk-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64libvirt-daemon-driver-storage-mpath-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64libvirt-daemon-driver-storage-gluster-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
python3-libvirt-8.0.0-2.module_el8.7.0+1218+f626c2ff.x86_64
libvirt-daemon-driver-nwfilter-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-lock-sanlock-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-interface-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64libvirt-daemon-driver-storage-iscsi-direct-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-kvm-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
During the issue with locked vms, they do not respond from network,I cannot use VNC console (or any other) to check what is happeningfrom VM perspective. The host cannot list running processes. Thereare plenty of resources left and each host runs about 30-35 vms.First I thought that it might be related to glusterfs (I usegluster on other clusters and it usually works fine), so wemigrated all vms back to old storage (NFS). The problem came backtoday on two hosts. I do not have such issues on other clusterwhich runs on Rocky 8.6 with hyperconverged glusterfs. Hence as alast resort I'll be migrating to Rocky 8 from CentOS 8-stream.
Has anyone observed such issues with ovirt hosts on CentOS8-stream? Any form of help is welcome, as I'm running out of ideas.
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:https://www.ovirt.org/community/about/community-guidelines/List Archives:https://lists.ovirt.org/archives/list/users@ovirt.org/message/2C7MAUFIUW6K5IJOIJ774PGNRAHSDIBL/
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:https://www.ovirt.org/community/about/community-guidelines/List Archives:https://lists.ovirt.org/archives/list/users@ovirt.org/message/QFFRNZKKXSVLZQCCM4SJJJXLMXDBNVOD/
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:https://www.ovirt.org/community/about/community-guidelines/List Archives:https://lists.ovirt.org/archives/list/users@ovirt.org/message/TZJV3LZAYHMJNX633ZQEG7BOUPVPJGTB/


--
Rik Theys
System Engineer
KU Leuven - Dept. Elektrotechniek (ESAT)
Kasteelpark Arenberg 10 bus 2440  - B-3001 Leuven-Heverlee
+32(0)16/32.11.07
----------------------------------------------------------------
<<Any errors in spelling, tact or fact are transmission errors>>
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/PLEZOLGASW2BYR4PL57JTNMPTOTAQ6KZ/

[ovirt-users] Re: VM causes CPU blocks and forces reboot of host

Reply via email to