On Wed, 2021-06-30 at 08:40 -0700, Matthew Schumacher wrote: > Hello, > > I'm not sure how to fix this, but calling 'crm resource restart vm- > name' this morning caused an entire node to get fenced, kicking the > stool out from under a number of VMs. > > Looking at VirtualDomain it looks like the system defaults to a 90s > timeout, and if it can't gracefully shutdown the VM with 'virsh > shutdown' in 85s, then it calls 'virsh destroy'. For whatever > reason, that's not what happened.
That would be the mystery to solve. It sounds like the node was fenced because the stop failed, so that would be where to investigate. > I created a mockup where I moved a test vm to it's own node (in case > it gets fenced), then loaded something that would ignore acpi > shutdown, then called restart. This time it worked. The logs show: > > Jun 30 15:32:11 VirtualDomain(vm-testvm)[13047]: INFO: Issuing > graceful shutdown request for domain testvm. > Jun 30 15:32:26 VirtualDomain(vm-testvm)[13047]: INFO: Issuing > forced shutdown (destroy) request for domain testvm. > > I don't have the logs from the original failure due to my node not > being persistent, but I wonder if anyone else has run into this. > > Here is my resource configuration if that reveals the issue: > > crm configure primitive vm-testvm2 VirtualDomain params > config="/datastore/vm/testvm/testvm.xml" migration_transport=ssh meta > allow-migrate=true target-role=Started op monitor timeout=30 > interval=30 > > Oh, one last question: Can I disable fencing for a specific resource > for testing reasons? I'd love to watch this break without fear of > fencing. Yes, for this scenario, configuring on-fail=block for the stop operation would cause the cluster to leave the VM alone if the stop fails. (The VM would not be recovered elsewhere.) > Matt -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/