Hi!

We are still suffering from kernel RAM corruption on the Xen hypervisor when a 
VM or the hypervisor is doing I/O (three months since the bug report at SUSE, 
but no fix or workaround meaning the whole Xen cluster project was canceled 
after 20 years, but that's a different topic). All VMs will be migrated to 
VMware, dumping the whole SLES15 Xen cluster very soon.

My script that detected RAM corruption tried to shutdown pacemaker, hoping for 
the best (i.e. VMs to be live-migrated away). However there are very strange 
decisions made (pacemaker-2.0.5+20201202.ba59be712-150300.4.21.1.x86_64):

May 24 17:05:07 h16 VirtualDomain(prm_xen_test-jeos7)[24460]: INFO: test-jeos7: 
live migration to h19 succeeded.
May 24 17:05:07 h16 VirtualDomain(prm_xen_test-jeos9)[24463]: INFO: test-jeos9: 
live migration to h19 succeeded.
May 24 17:05:07 h16 pacemaker-execd[7504]:  notice: prm_xen_test-jeos7 
migrate_to (call 321, PID 24281) exited with status 0 (execution time 5500ms, 
queue time 0ms)
May 24 17:05:07 h16 pacemaker-controld[7509]:  notice: Result of migrate_to 
operation for prm_xen_test-jeos7 on h16: ok
May 24 17:05:07 h16 pacemaker-execd[7504]:  notice: prm_xen_test-jeos9 
migrate_to (call 323, PID 24283) exited with status 0 (execution time 5514ms, 
queue time 0ms)
May 24 17:05:07 h16 pacemaker-controld[7509]:  notice: Result of migrate_to 
operation for prm_xen_test-jeos9 on h16: ok

Would you agree that the migration was successful? I'd say YES!

However this is what happened:

May 24 17:05:19 h16 pacemaker-controld[7509]:  notice: Transition 2460 
(Complete=16, Pending=0, Fired=0, Skipped=7, Incomplete=57, 
Source=/var/lib/pacemaker/pengine/pe-input-89.bz2): Stopped
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Unexpected result 
(error) was recorded for stop of prm_ping_gw1:1 on h16 at May 24 17:05:02 2022
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Unexpected result 
(error) was recorded for stop of prm_ping_gw1:1 on h16 at May 24 17:05:02 2022
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Cluster node h16 will 
be fenced: prm_ping_gw1:1 failed there
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Unexpected result 
(error) was recorded for stop of prm_iotw-md10:1 on h16 at May 24 17:05:02 2022
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Unexpected result 
(error) was recorded for stop of prm_iotw-md10:1 on h16 at May 24 17:05:02 2022
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_ping_gw1 
away from h16 after 1000000 failures (max=1000000)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_ping_gw1 
away from h16 after 1000000 failures (max=1000000)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_ping_gw1 
away from h16 after 1000000 failures (max=1000000)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_iotw-md10 
away from h16 after 1000000 failures (max=1000000)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_iotw-md10 
away from h16 after 1000000 failures (max=1000000)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Forcing cln_iotw-md10 
away from h16 after 1000000 failures (max=1000000)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  notice: Resource 
prm_xen_test-jeos7 can no longer migrate from h16 to h19 (will stop on both 
nodes)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  notice: Resource 
prm_xen_test-jeos9 can no longer migrate from h16 to h19 (will stop on both 
nodes)
May 24 17:05:19 h16 pacemaker-schedulerd[7508]:  warning: Scheduling Node h16 
for STONITH

So the DC considers the migration to have failed, even though it was reported 
as success!
(The ping had dumped core due to RAM corruption before)

May 24 17:03:12 h16 kernel: ping[23973]: segfault at 213e6 ip 00000000000213e6 
sp 00007ffc249fab78 error 14 in bash[5655262bc000+f1000]

So it stopped the VMs that were migrated successfully before:
May 24 17:05:19 h16 pacemaker-controld[7509]:  notice: Initiating stop 
operation prm_xen_test-jeos7_stop_0 on h19
May 24 17:05:19 h16 pacemaker-controld[7509]:  notice: Initiating stop 
operation prm_xen_test-jeos9_stop_0 on h19
May 24 17:05:19 h16 pacemaker-controld[7509]:  notice: Requesting fencing 
(reboot) of node h16

Those test VMs were not important, but the important part was that due to the 
failure to stop the ping resource, it did not even try to migrate the other VMs 
(non-test) away, so those were hard-fenced.

For completeness I should add that the RAM corruption also affected pacemaker 
itself:

May 24 17:05:02 h16 kernel: traps: pacemaker-execd[24272] general protection 
fault ip:7fc572327bcf sp:7ffca7cd22d0 error:0 in 
libc-2.31.so[7fc572246000+1e6000]
May 24 17:05:02 h16 kernel: pacemaker-execd[24277]: segfault at 0 ip 
0000000000000000 sp 00007ffca7cd22f0 error 14 in 
pacemaker-execd[56347df4e000+b000]
May 24 17:05:02 h16 kernel: Code: Bad RIP value.

That affected the stop of some (non-essential) ping and  MD-RAID-based 
resources:
May 24 17:05:02 h16 pacemaker-execd[7504]:  warning: prm_ping_gw1_stop_0[24272] 
terminated with signal: Segmentation fault
May 24 17:05:02 h16 pacemaker-execd[7504]:  warning: 
prm_iotw-md10_stop_0[24277] terminated with signal: Segmentation fault

May 24 17:05:03 h16 sbd[7254]:  warning: inquisitor_child: pcmk health check: 
UNHEALTHY
May 24 17:05:03 h16 sbd[7254]:  warning: inquisitor_child: Servant pcmk is 
outdated (age: 1844062)

Note: If the "outdated" number is seconds, that's definitely odd!

Regards,
Ulrich


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to