Well, I'm not getting failures right now simply with attributes, but I can induce a failure by stopping the vm-db02 (it puts db02 into an unclean state, and attempts to migrate the unrelated vm-compute-test). I've collected the commands from my latest interactions, a crm_report, and a gdb traceback from the core file that crmd dumped, into bug 5164.
On Tue, Jul 2, 2013 at 8:40 PM, David Vossel <dvos...@redhat.com> wrote: > > > > > ----- Original Message ----- > > From: "Lindsay Todd" <rltodd....@gmail.com> > > To: "The Pacemaker cluster resource manager" < > pacemaker@oss.clusterlabs.org> > > Sent: Tuesday, July 2, 2013 5:36:43 PM > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes > > > > You didn't notice that after setting attributes on "db02", the remote > node > > "db02" went offline as "unclean", even though vm-db02 was still running? > > nope... apparently I'm blind :) > > > That strikes me as wrong! Once it gets into this state, I can order > vm-db02 > > to stop, but it never will. Indeed, pacemaker doesn't do much at this > point > > I'm really confused about how a remote-node could manage to get into an > "UNCLEAN" state. Interesting. Can you reproduce it easily? A crm_report > attached to a bugs.clusterlabs.org issue would be helpful. If you > haven't erased your logs you could still retrieve everything in the report > the the specific time period it occurred in. I definitely need to get that > worked out. > > > -- I can put everything into standby mode, and services don't shut down. > > That is why the forcible reboot. Also, why I don't know (yet) what would > > happen to a service on db02 when this happens -- it takes too long to > > restart the cluster to carry out too many tests in one day! > > > > I'll review asymmetrical clusters -- I think my mistake was thinking an > > infinite score location constraint to put DummyOnVM on db02 would > prevent it > > from running anywhere else, but of course of db02 isn't running, my one > rule > > isn't equivalent to having -inf scores elsewhere. Still odd that shutting > > down vm-db02 would trigger a migration of an unrelated VM. > > look into resource stickiness. Setting a default resource stickiness > should prevent this. It might be that shutting down vm-db02 some how meant > that pacemaker decided to balance out the resources in a way that involved > migrating the other vm. > > -- Vossel > > > (The fact that > > would also stop vm-swbuild is the known problem that constraints don't > work > > well with migration.) > > > > > > > > > > On Tue, Jul 2, 2013 at 6:20 PM, David Vossel < dvos...@redhat.com > > wrote: > > > > > > > > ----- Original Message ----- > > > From: "Lindsay Todd" < rltodd....@gmail.com > > > > To: "The Pacemaker cluster resource manager" < > > > pacemaker@oss.clusterlabs.org > > > > Sent: Tuesday, July 2, 2013 4:05:22 PM > > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes > > > > > > Sorry for the delayed response, but I was out last week. I've applied > this > > > patch to 1.1.10-rc5 and have been testing: > > > > > > > Thanks for testing :) > > > > > > > > > > > # crm_attribute --type status --node "db02" --name "service_postgresql" > > > --update "true" > > > # crm_attribute --type status --node "db02" --name "service_postgresql" > > > scope=status name=service_postgresql value=true > > > # crm resource stop vm-db02 > > > # crm resource start vm-db02 > > > ### Wait a bit > > > # crm_attribute --type status --node "db02" --name "service_postgresql" > > > scope=status name=service_postgresql value=(null) > > > Error performing operation: No such device or address > > > # crm_attribute --type status --node "db02" --name "service_postgresql" > > > --update "true" > > > # crm_attribute --type status --node "db02" --name "service_postgresql" > > > scope=status name=service_postgresql value=true > > > > > > Good so far. But now look at this (every node was clean, and all > services > > > were running, before we started): > > > > > > > > > > > > # crm status > > > Last updated: Tue Jul 2 16:15:14 2013 > > > Last change: Tue Jul 2 16:15:12 2013 via crmd on cvmh02 > > > Stack: cman > > > Current DC: cvmh02 - partition with quorum > > > Version: 1.1.10rc5-1.el6.ccni-2718638 > > > 9 Nodes configured, unknown expected votes > > > 59 Resources configured. > > > > > > > > > Node db02: UNCLEAN (offline) > > > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01 > > > ldap02:vm-ldap02 ] > > > OFFLINE: [ swbuildsl6:vm-swbuildsl6 ] > > > > > > Full list of resources: > > > > > > fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04 > > > fence-cvmh02 (stonith:fence_ipmilan): Started cvmh04 > > > fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04 > > > fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01 > > > Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > Clone Set: c-p-libvirtd [p-libvirtd] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > Clone Set: c-watch-ib0 [p-watch-ib0] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > Clone Set: c-fs-gpfs [p-fs-gpfs] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh03 > > > vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Stopped > > > vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh02 > > > vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03 > > > vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04 > > > DummyOnVM (ocf::pacemaker:Dummy): Started cvmh01 > > > > > > Not so good, and I'm not sure how to clean this up. I can't seem to > stop > > > > clean what up? I don't understand what I'm expected to notice out of > place > > here?! The remote-node us up, everything looks happy. > > > > > vm-db02 any more, even after I've entered: > > > > > > > > > > > > # crm_node -R db02 --force > > > > That won't stop the remote-node. 'crm resource stop vm-db02' should > though. > > > > > # crm resource start vm-db02 > > > > ha, I'm so confused. why are you trying to start it? I thought you were > > trying to stop the resource? > > > > > > > > > > > > > > ### Wait a bit > > > > > > > > > > > > # crm status > > > Last updated: Tue Jul 2 16:32:38 2013 > > > Last change: Tue Jul 2 16:27:28 2013 via cibadmin on cvmh01 > > > Stack: cman > > > Current DC: cvmh02 - partition with quorum > > > Version: 1.1.10rc5-1.el6.ccni-2718638 > > > 8 Nodes configured, unknown expected votes > > > 54 Resources configured. > > > > > > > > > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ldap01:vm-ldap01 ldap02:vm-ldap02 > > > swbuildsl6:vm-swbuildsl6 ] > > > OFFLINE: [ db02:vm-db02 ] > > > > > > fence-cvmh01 (stonith:fence_ipmilan): Started cvmh03 > > > fence-cvmh02 (stonith:fence_ipmilan): Started cvmh03 > > > fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04 > > > fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01 > > > Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > Clone Set: c-p-libvirtd [p-libvirtd] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > Clone Set: c-watch-ib0 [p-watch-ib0] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > Clone Set: c-fs-gpfs [p-fs-gpfs] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh02 > > > vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Started cvmh01 > > > vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03 > > > vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04 > > > DummyOnVM (ocf::pacemaker:Dummy): Started cvmh01 > > > > > > My only recourse has been to reboot the cluster. > > > > > > So let's do that and try > > > setting a location constraint on DummyOnVM, to force it on db02... > > > > > > > > > > > > > > > > > > > > > > > > Last updated: Tue Jul 2 16:43:46 2013 > > > Last change: Tue Jul 2 16:27:28 2013 via cibadmin on cvmh01 > > > Stack: cman > > > Current DC: cvmh02 - partition with quorum > > > Version: 1.1.10rc5-1.el6.ccni-2718638 > > > 8 Nodes configured, unknown expected votes > > > 54 Resources configured. > > > > > > > > > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01 > > > ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ] > > > > > > fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04 > > > fence-cvmh02 (stonith:fence_ipmilan): Started cvmh03 > > > fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04 > > > fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01 > > > Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > Clone Set: c-p-libvirtd [p-libvirtd] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > Clone Set: c-watch-ib0 [p-watch-ib0] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > Clone Set: c-fs-gpfs [p-fs-gpfs] > > > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ] > > > Stopped: [ db02 ldap01 ldap02 swbuildsl6 ] > > > vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh01 > > > vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Started cvmh01 > > > vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh02 > > > vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03 > > > vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04 > > > DummyOnVM (ocf::pacemaker:Dummy): Started cvmh03 > > > > > > # pcs constraint location DummyOnVM prefers db02 > > > # crm status > > > ... > > > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01 > > > ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ] > > > ... > > > DummyOnVM (ocf::pacemaker:Dummy): Started db02 > > > > > > > > > That's what we want to see. It would be interesting to stop db02. I > expect > > > DummyOnVM to stop. > > > > OH, okay, so you wanted DummyOnVM to start on db02. > > > > > > > > > > > > > > # crm resource stop vm-db02 > > > # crm status > > > ... > > > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ldap01:vm-ldap01 > ldap02:vm-ldap02 ] > > > OFFLINE: [ db02:vm-db02 swbuildsl6:vm-swbuildsl6 ] > > > ... > > > DummyOnVM (ocf::pacemaker:Dummy): Started cvmh02 > > > > > > Failed actions: > > > vm-compute-test_migrate_from_0 (node=cvmh02, call=147, rc=1, > status=Timed > > > Out, last-rc-change=Tue Jul 2 16:48:17 2013 > > > , queued=20003ms, exec=0ms > > > ): unknown error > > > > > > Well, that is odd. (It is the case that vm-swbuildsl6 has an order > > > dependency > > > on vm-compute-test, as I was trying to understand how migrations worked > > > with > > > order dependencies (not very well). > > > > I don't think this failure has anything to do with the order > dependencies. If > > pacemaker attempted to live migrate the vm and it fails, that's a > resource > > problem. Do you have your virtual machine images on shared storage? > > > > > Once vm-compute-test recovers, > > > vm-swbuildsl6 does come back up.) This isn't really very good -- if I > am > > > running services in VM or other containers, I need them to run only in > that > > > container! > > > > Read about the differences between asymmetrical and symmetrical > clusters. I > > think this will help this make sense. By default resources can run > anywhere, > > you just gave more weight to db02 for the Dummy resource, meaning it > prefers > > that node when it is around. > > > > > http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_deciding_which_nodes_a_resource_can_run_on > > > > > > > > > > If I start vm-db02 back up, I see that DummyOnVM is stopped and moved > to > > > db02. > > > > Yep, this is what I'd expect for a symmetrical cluster. > > > > Thanks again for the feedback, hope the asymmetrical/symmetrical cluster > > stuff helps :) > > > > -- Vossel > > > > > > > > > > > On Thu, Jun 20, 2013 at 4:16 PM, David Vossel < dvos...@redhat.com > > wrote: > > > > > > > > > > > > ----- Original Message ----- > > > > From: "David Vossel" < dvos...@redhat.com > > > > > To: "The Pacemaker cluster resource manager" < > > > > pacemaker@oss.clusterlabs.org > > > > > Sent: Thursday, June 20, 2013 1:35:44 PM > > > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and > attributes > > > > > > > > ----- Original Message ----- > > > > > From: "David Vossel" < dvos...@redhat.com > > > > > > To: "The Pacemaker cluster resource manager" > > > > > < pacemaker@oss.clusterlabs.org > > > > > > Sent: Wednesday, June 19, 2013 4:47:58 PM > > > > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and > attributes > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Lindsay Todd" < rltodd....@gmail.com > > > > > > > To: "The Pacemaker cluster resource manager" > > > > > > < Pacemaker@oss.clusterlabs.org > > > > > > > Sent: Wednesday, June 19, 2013 4:11:58 PM > > > > > > Subject: [Pacemaker] Pacemaker remote nodes, naming, and > attributes > > > > > > > > > > > > I built a set of rpms for pacemaker 1.1.0-rc4 and updated my test > > > > > > cluster > > > > > > (hopefully won't be a "test" cluster forever), as well as my VMs > > > > > > running > > > > > > pacemaker-remote. The OS everywhere is Scientific Linux 6.4. I am > > > > > > wanting > > > > > > to > > > > > > set some attributes on remote nodes, which I can use to control > where > > > > > > services run. > > > > > > > > > > > > The first deviation I note from the documentation is the naming > of > > > > > > the > > > > > > remote > > > > > > nodes. I see: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Last updated: Wed Jun 19 16:50:39 2013 > > > > > > Last change: Wed Jun 19 16:19:53 2013 via cibadmin on cvmh04 > > > > > > Stack: cman > > > > > > Current DC: cvmh02 - partition with quorum > > > > > > Version: 1.1.10rc4-1.el6.ccni-d19719c > > > > > > 8 Nodes configured, unknown expected votes > > > > > > 49 Resources configured. > > > > > > > > > > > > > > > > > > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 > ldap01:vm-ldap01 > > > > > > ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ] > > > > > > > > > > > > Full list of resources: > > > > > > > > > > > > and so forth. The "remote-node" names are simply the hostname, > so the > > > > > > vm-db02 > > > > > > VirtualDomain resource has a remote-node name of db02. The > "Pacemaker > > > > > > Remote" manual suggests this should be displayed as "db02", not > > > > > > "db02:vm-db02", although I can see how the latter format would be > > > > > > useful. > > > > > > > > > > Yep, this got changed since the documentation was published. We > wanted > > > > > people to be able to recognize which remote-node went with which > > > > > resource > > > > > easily. > > > > > > > > > > > > > > > > > So now let's set an attribute on this remote node. What name do I > > > > > > use? > > > > > > How > > > > > > about: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > # crm_attribute --node "db02:vm-db02" \ > > > > > > --name "service_postgresql" \ > > > > > > --update "true" > > > > > > Could not map name=db02:vm-db02 to a UUID > > > > > > Please choose from one of the matches above and suppy the 'id' > with > > > > > > --attr-id > > > > > > > > > > > > Perhaps not the most informative output, but obviously it fails. > > > > > > Let's > > > > > > try > > > > > > the unqualified name: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > # crm_attribute --node "db02" \ > > > > > > --name "service_postgresql" \ > > > > > > --update "true" > > > > > > Remote-nodes do not maintain permanent attributes, > > > > > > 'service_postgresql=true' > > > > > > will be removed after db02 reboots. > > > > > > Error setting service_postgresql=true (section=status, > > > > > > set=status-db02): > > > > > > No > > > > > > such device or address > > > > > > Error performing operation: No such device or address > > > > > > > > I just tested this and ran into the same errors you did. Turns out > this > > > > happens when the remote-node's status section is empty. If you start > a > > > > resource on the node and then set the attribute it will work... > obviously > > > > this is a bug. I'm working on a fix. > > > > > > This should help with the attributes bit. > > > > > > > https://github.com/ClusterLabs/pacemaker/commit/26d34a9171bddae67c56ebd8c2513ea8fa770204 > > > > > > -- Vossel > > > > > > _______________________________________________ > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > Bugs: http://bugs.clusterlabs.org > > > > > > > > > _______________________________________________ > > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > > Bugs: http://bugs.clusterlabs.org > > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org