Re: [ClusterLabs] Early VM resource migration
On 12/16/2015 10:30 AM, Klechomir wrote: > On 16.12.2015 17:52, Ken Gaillot wrote: >> On 12/16/2015 02:09 AM, Klechomir wrote: >>> Hi list, >>> I have a cluster with VM resources on a cloned active-active storage. >>> >>> VirtualDomain resource migrates properly during failover (node standby), >>> but tries to migrate back too early, during failback, ignoring the >>> "order" constraint, telling it to start when the cloned storage is up. >>> This causes unnecessary VM restart. >>> >>> Is there any way to make it wait, until its storage resource is up? >> Hi Klecho, >> >> If you have an order constraint, the cluster will not try to start the >> VM until the storage resource agent returns success for its start. If >> the storage isn't fully up at that point, then the agent is faulty, and >> should be modified to wait until the storage is truly available before >> returning success. >> >> If you post all your constraints, I can look for anything that might >> affect the behavior. > Thanks for the reply, Ken > > Seems to me that that the constraints for a cloned resources act a a bit > different. > > Here is my config: > > primitive p_AA_Filesystem_CDrive1 ocf:heartbeat:Filesystem \ > params device="/dev/CSD_CDrive1/AA_CDrive1" > directory="/volumes/AA_CDrive1" fstype="ocfs2" options="rw,noatime" > primitive VM_VM1 ocf:heartbeat:VirtualDomain \ > params config="/volumes/AA_CDrive1/VM_VM1/VM1.xml" > hypervisor="qemu:///system" migration_transport="tcp" \ > meta allow-migrate="true" target-role="Started" > clone AA_Filesystem_CDrive1 p_AA_Filesystem_CDrive1 \ > meta interleave="true" resource-stickiness="0" > target-role="Started" > order VM_VM1_after_AA_Filesystem_CDrive1 inf: AA_Filesystem_CDrive1 VM_VM1 > > Every time when a node comes back from standby, the VM tries to live > migrate to it long before the filesystem is up. In most cases (including this one), when you have an order constraint, you also need a colocation constraint. colocation = two resources must be run on the same node order = one resource must be started/stopped/whatever before another Or you could use a group, which is essentially a shortcut for specifying colocation and order constraints for any sequence of resources. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Early VM resource migration
>>> Klechomirschrieb am 16.12.2015 um 17:30 in Nachricht <5671918e.40...@gmail.com>: > On 16.12.2015 17:52, Ken Gaillot wrote: >> On 12/16/2015 02:09 AM, Klechomir wrote: >>> Hi list, >>> I have a cluster with VM resources on a cloned active-active storage. >>> >>> VirtualDomain resource migrates properly during failover (node standby), >>> but tries to migrate back too early, during failback, ignoring the >>> "order" constraint, telling it to start when the cloned storage is up. >>> This causes unnecessary VM restart. >>> >>> Is there any way to make it wait, until its storage resource is up? >> Hi Klecho, >> >> If you have an order constraint, the cluster will not try to start the >> VM until the storage resource agent returns success for its start. If >> the storage isn't fully up at that point, then the agent is faulty, and >> should be modified to wait until the storage is truly available before >> returning success. >> >> If you post all your constraints, I can look for anything that might >> affect the behavior. > Thanks for the reply, Ken > > Seems to me that that the constraints for a cloned resources act a a bit > different. > > Here is my config: > > primitive p_AA_Filesystem_CDrive1 ocf:heartbeat:Filesystem \ > params device="/dev/CSD_CDrive1/AA_CDrive1" > directory="/volumes/AA_CDrive1" fstype="ocfs2" options="rw,noatime" > primitive VM_VM1 ocf:heartbeat:VirtualDomain \ > params config="/volumes/AA_CDrive1/VM_VM1/VM1.xml" > hypervisor="qemu:///system" migration_transport="tcp" \ > meta allow-migrate="true" target-role="Started" > clone AA_Filesystem_CDrive1 p_AA_Filesystem_CDrive1 \ > meta interleave="true" resource-stickiness="0" > target-role="Started" > order VM_VM1_after_AA_Filesystem_CDrive1 inf: AA_Filesystem_CDrive1 VM_VM1 > > Every time when a node comes back from standby, the VM tries to live > migrate to it long before the filesystem is up. Hi! To me your config looks rather incomplete: What about DLM, O2CB, cLVM, etc.? > >> ___ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Early VM resource migration
On 12/16/2015 02:09 AM, Klechomir wrote: > Hi list, > I have a cluster with VM resources on a cloned active-active storage. > > VirtualDomain resource migrates properly during failover (node standby), > but tries to migrate back too early, during failback, ignoring the > "order" constraint, telling it to start when the cloned storage is up. > This causes unnecessary VM restart. > > Is there any way to make it wait, until its storage resource is up? Hi Klecho, If you have an order constraint, the cluster will not try to start the VM until the storage resource agent returns success for its start. If the storage isn't fully up at that point, then the agent is faulty, and should be modified to wait until the storage is truly available before returning success. If you post all your constraints, I can look for anything that might affect the behavior. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [Pacemaker] Beginner | Resources stuck unloading
On 12/14/2015 12:18 AM, Tyler Hampton wrote: > Hi! > > I'm currently trying to semi-follow Sebastien Han's blog post on > implementing HA with Ceph rbd volumes and I am hitting some walls. The > difference between what I'm trying to do and the blog post is that I'm > trying to implement an active/passive instead of an active/active. > > I am able to get the two nodes to recognize each other and for a single > node to assume resources. However, the setup is fairly finnicky (I'm > assuming due to my ignorance) and I can't get it to work most of the time. > > When I do get a pair and try to fail over (service pacemaker stop) the node > that I'm stopping pacemaker on fails to unload its controlled resources and > goes into a loop. A 'proper' failover has only happened twice. > > pacemaker stop output (with log output): > https://gist.github.com/howdoicomputer/d88e224f6fead4623efc > > resource configuration: > https://gist.github.com/howdoicomputer/a6f846eb54c3024a5be9 > > Any help is greatly appreciated. Hopefully someone with more ceph or upstart experience can give you more specifics. But generally, stonith-enabled=false can lead to error recovery problems and make trouble harder to diagnose. If you can take the time to get stonith working, it should at least stop your first problem from causing further problems. If you're using corosync 2, you can set "two_node: 1" in corosync.conf, and delete the no-quorum-policy=ignore setting in Pacemaker. It won't make a huge difference, but corosync 2 can handle it better now. If you are doing a planned failover, a better way would be to put the node into standby mode first, then stop pacemaker. That ensures all resources are successfully failed over first, and when the node comes back, it lets you decide when it's ready to host resources again (by taking it out of standby mode), which gives you time for administration/troubleshooting/whatever reason you took it down. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org