> >Super stable environment for many years through software and hardware > >upgrades, few issues to speak of, then without warning one of my > >hypervisors in 3 node group crashed with a memory dimm error, cluster > >HA took over and restarted the VMs on the other two nodes in the group > >as expected. The problem quickly materialized as the VMs started > >rebooting quickly, a lot of network issues and notice of migration > >pending. I could not lockdown exactly what the root cause was. Notable > This sounds like it wanted to balance the load. Do you have CRS active and/or > static load scheduling? CRS option is set to basic, not dynamic.
> > >was these particular VMs all have multiple network interfaces. After > >several hours of not being able to get the current VMs stable, I tried > >spinning up new VMs on to no avail, reboots persisted on the new VMs. > >This seemed to only affect the VMs that were on the hypervisor that > >failed all other VMs across the cluster were fine. > > > >I have not installed any third-party monitoring software, found a few > >post in the forum about it, but was not my issue. > > > >In an act of desperation, I performed a dist-upgrade and this solved > >the issue straight away. > >Kernel Version Linux 6.8.12-4-pve (2024-11-06T15:04Z) > >Manager Version pve-manager/8.3.0/c1689ccb1065a83b > The upgrade likely restarted the pve-ha-lrm service, which could break the > migration cycle. > > The systemd logs should give you a clue to what was happening, the ha stack > logs the actions on the given node. I don't see anything particular in the lrm logs, just starting the VMs over and over. Here are relevant syslog entries from the end of one cycle reboot to beginning startup. 2024-11-21T18:36:59.023578-06:00 vvepve13 qmeventd[3838]: Starting cleanup for 13101 2024-11-21T18:36:59.105435-06:00 vvepve13 qmeventd[3838]: Finished cleanup for 13101 2024-11-21T18:37:30.758618-06:00 vvepve13 pve-ha-lrm[1608]: successfully acquired lock 'ha_agent_vvepve13_lock' 2024-11-21T18:37:30.758861-06:00 vvepve13 pve-ha-lrm[1608]: watchdog active 2024-11-21T18:37:30.758977-06:00 vvepve13 pve-ha-lrm[1608]: status change wait_for_agent_lock => active 2024-11-21T18:37:30.789271-06:00 vvepve13 pve-ha-lrm[4337]: starting service vm:13101 2024-11-21T18:37:30.808204-06:00 vvepve13 pve-ha-lrm[4338]: start VM 13101: UPID:vvepve13:000010F2:00007AEA:673FD24A:qmstart:13101:root@pam: 2024-11-21T18:37:30.808383-06:00 vvepve13 pve-ha-lrm[4337]: <root@pam> starting task UPID:vvepve13:000010F2:00007AEA:673FD24A:qmstart:13101:root@pam: 2024-11-21T18:37:31.112154-06:00 vvepve13 systemd[1]: Started 13101.scope. 2024-11-21T18:37:32.802414-06:00 vvepve13 kernel: [ 316.379944] tap13101i0: entered promiscuous mode 2024-11-21T18:37:32.846352-06:00 vvepve13 kernel: [ 316.423935] vmbr0: port 10(tap13101i0) entered blocking state 2024-11-21T18:37:32.846372-06:00 vvepve13 kernel: [ 316.423946] vmbr0: port 10(tap13101i0) entered disabled state 2024-11-21T18:37:32.846375-06:00 vvepve13 kernel: [ 316.423990] tap13101i0: entered allmulticast mode 2024-11-21T18:37:32.847377-06:00 vvepve13 kernel: [ 316.424825] vmbr0: port 10(tap13101i0) entered blocking state 2024-11-21T18:37:32.847391-06:00 vvepve13 kernel: [ 316.424832] vmbr0: port 10(tap13101i0) entered forwarding state 2024-11-21T18:37:34.594397-06:00 vvepve13 kernel: [ 318.172029] tap13101i1: entered promiscuous mode 2024-11-21T18:37:34.640376-06:00 vvepve13 kernel: [ 318.217302] vmbr0: port 11(tap13101i1) entered blocking state 2024-11-21T18:37:34.640393-06:00 vvepve13 kernel: [ 318.217310] vmbr0: port 11(tap13101i1) entered disabled state 2024-11-21T18:37:34.640396-06:00 vvepve13 kernel: [ 318.217341] tap13101i1: entered allmulticast mode 2024-11-21T18:37:34.640398-06:00 vvepve13 kernel: [ 318.218073] vmbr0: port 11(tap13101i1) entered blocking state 2024-11-21T18:37:34.640400-06:00 vvepve13 kernel: [ 318.218077] vmbr0: port 11(tap13101i1) entered forwarding state 2024-11-21T18:37:35.819630-06:00 vvepve13 pve-ha-lrm[4337]: Task 'UPID:vvepve13:000010F2:00007AEA:673FD24A:qmstart:13101:root@pam:' still active, waiting 2024-11-21T18:37:36.249349-06:00 vvepve13 kernel: [ 319.827024] tap13101i2: entered promiscuous mode 2024-11-21T18:37:36.291346-06:00 vvepve13 kernel: [ 319.868406] vmbr0: port 12(tap13101i2) entered blocking state 2024-11-21T18:37:36.291365-06:00 vvepve13 kernel: [ 319.868417] vmbr0: port 12(tap13101i2) entered disabled state 2024-11-21T18:37:36.291367-06:00 vvepve13 kernel: [ 319.868443] tap13101i2: entered allmulticast mode 2024-11-21T18:37:36.291368-06:00 vvepve13 kernel: [ 319.869185] vmbr0: port 12(tap13101i2) entered blocking state 2024-11-21T18:37:36.291369-06:00 vvepve13 kernel: [ 319.869191] vmbr0: port 12(tap13101i2) entered forwarding state 2024-11-21T18:37:37.997394-06:00 vvepve13 kernel: [ 321.575034] tap13101i3: entered promiscuous mode 2024-11-21T18:37:38.040384-06:00 vvepve13 kernel: [ 321.617225] vmbr0: port 13(tap13101i3) entered blocking state 2024-11-21T18:37:38.040396-06:00 vvepve13 kernel: [ 321.617236] vmbr0: port 13(tap13101i3) entered disabled state 2024-11-21T18:37:38.040400-06:00 vvepve13 kernel: [ 321.617278] tap13101i3: entered allmulticast mode 2024-11-21T18:37:38.040402-06:00 vvepve13 kernel: [ 321.618070] vmbr0: port 13(tap13101i3) entered blocking state 2024-11-21T18:37:38.040403-06:00 vvepve13 kernel: [ 321.618077] vmbr0: port 13(tap13101i3) entered forwarding state 2024-11-21T18:37:38.248094-06:00 vvepve13 pve-ha-lrm[4337]: <root@pam> end task UPID:vvepve13:000010F2:00007AEA:673FD24A:qmstart:13101:root@pam: OK 2024-11-21T18:37:38.254144-06:00 vvepve13 pve-ha-lrm[4337]: service status vm:13101 started 2024-11-21T18:37:44.256824-06:00 vvepve13 QEMU[3794]: kvm: ../accel/kvm/kvm-all.c:1836: kvm_irqchip_commit_routes: Assertion `ret == 0' failed. 2024-11-21T18:38:17.486394-06:00 vvepve13 kernel: [ 361.063298] vmbr0: port 10(tap13101i0) entered disabled state 2024-11-21T18:38:17.486423-06:00 vvepve13 kernel: [ 361.064099] tap13101i0 (unregistering): left allmulticast mode 2024-11-21T18:38:17.486426-06:00 vvepve13 kernel: [ 361.064110] vmbr0: port 10(tap13101i0) entered disabled state 2024-11-21T18:38:17.510386-06:00 vvepve13 kernel: [ 361.087517] vmbr0: port 11(tap13101i1) entered disabled state 2024-11-21T18:38:17.510400-06:00 vvepve13 kernel: [ 361.087796] tap13101i1 (unregistering): left allmulticast mode 2024-11-21T18:38:17.510403-06:00 vvepve13 kernel: [ 361.087805] vmbr0: port 11(tap13101i1) entered disabled state 2024-11-21T18:38:17.540386-06:00 vvepve13 kernel: [ 361.117511] vmbr0: port 12(tap13101i2) entered disabled state 2024-11-21T18:38:17.540402-06:00 vvepve13 kernel: [ 361.117817] tap13101i2 (unregistering): left allmulticast mode 2024-11-21T18:38:17.540404-06:00 vvepve13 kernel: [ 361.117827] vmbr0: port 12(tap13101i2) entered disabled state 2024-11-21T18:38:17.561380-06:00 vvepve13 kernel: [ 361.138518] vmbr0: port 13(tap13101i3) entered disabled state 2024-11-21T18:38:17.561394-06:00 vvepve13 kernel: [ 361.138965] tap13101i3 (unregistering): left allmulticast mode 2024-11-21T18:38:17.561399-06:00 vvepve13 kernel: [ 361.138977] vmbr0: port 13(tap13101i3) entered disabled state 2024-11-21T18:38:17.584412-06:00 vvepve13 systemd[1]: 13101.scope: Deactivated successfully. 2024-11-21T18:38:17.584619-06:00 vvepve13 systemd[1]: 13101.scope: Consumed 51.122s CPU time. 2024-11-21T18:38:18.522886-06:00 vvepve13 pvestatd[1476]: VM 13101 qmp command failed - VM 13101 not running 2024-11-21T18:38:18.523725-06:00 vvepve13 pve-ha-lrm[4889]: <root@pam> end task UPID:vvepve13:0000131A:00008A78:673FD272:qmstart:13104:root@pam: OK 2024-11-21T18:38:18.945142-06:00 vvepve13 qmeventd[4990]: Starting cleanup for 13101 2024-11-21T18:38:19.022405-06:00 vvepve13 qmeventd[4990]: Finished cleanup for 13101 Thanks JR _______________________________________________ pve-user mailing list [email protected] https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
