So.... it took me a while to have everything packaged and so on, but eventually, I managed to upgrade my cluster to corosync2/pacemaker1.1.11 (version advertised is 1.1.10-9d39a6b). ALthough I have a much more fficient communicatio between nodes I still have the same issue with this ordering constraint that uses clones on both sides. The ordering contraint works if I set a primitive as the first resource. But if I put this primitive in a clone resource, it stops working.
Above are the logs I get on the node were the fist resource starts: Mar 22 23:29:18 sanaoe02 crmd[10989]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Mar 22 23:29:18 sanaoe02 cib[10984]: notice: cib:diff: Diff: --- 0.916.2 Mar 22 23:29:18 sanaoe02 cib[10984]: notice: cib:diff: Diff: +++ 0.917.1 5da74572ddb3a247189b39d515918343 Mar 22 23:29:18 sanaoe02 cib[10984]: notice: cib:diff: -- <nvpair value="Stopped" id="cln_aoe-meta_attributes-target-role"/> Mar 22 23:29:18 sanaoe02 cib[10984]: notice: cib:diff: ++ <nvpair id="cln_aoe-meta_attributes-target-role" name="target-role" value="Started"/> Mar 22 23:29:18 sanaoe02 pengine[10988]: notice: unpack_rsc_op: Preventing cln_aoe from re-starting on dir01: operation monitor failed 'not installed' (5) Mar 22 23:29:18 sanaoe02 pengine[10988]: notice: unpack_rsc_op: Preventing cln_aoe from re-starting on mta02: operation monitor failed 'not installed' (5) Mar 22 23:29:18 sanaoe02 pengine[10988]: notice: unpack_rsc_op: Preventing cln_aoe from re-starting on ms02: operation monitor failed 'not installed' (5) Mar 22 23:29:18 sanaoe02 pengine[10988]: notice: unpack_rsc_op: Preventing cln_aoe from re-starting on mx02: operation monitor failed 'not installed' (5) Mar 22 23:29:18 sanaoe02 pengine[10988]: notice: unpack_rsc_op: Preventing cln_aoe from re-starting on dir02: operation monitor failed 'not installed' (5) Mar 22 23:29:18 sanaoe02 pengine[10988]: notice: LogActions: Start pri_aoe1:0#011(sanaoe02) Mar 22 23:29:18 sanaoe02 crmd[10989]: notice: te_rsc_command: Initiating action 39: start pri_aoe1_start_0 on sanaoe02 (local) Mar 22 23:29:18 sanaoe02 pengine[10988]: notice: process_pe_message: Calculated Transition 377: /var/lib/pacemaker/pengine/pe-input-100.bz2 Mar 22 23:29:18 sanaoe02 AoEtarget(pri_aoe1)[14285]: INFO: Exporting device /dev/xvdb on eth1 as shelf 2, slot 1 Mar 22 23:29:18 sanaoe02 AoEtarget(pri_aoe1)[14285]: DEBUG: pri_aoe1 start : 0 Mar 22 23:29:19 sanaoe02 crmd[10989]: notice: process_lrm_event: LRM operation pri_aoe1_start_0 (call=194, rc=0, cib-update=982, confirmed=true) ok Mar 22 23:29:19 sanaoe02 crmd[10989]: notice: run_graph: Transition 377 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-100.bz2): Complete Mar 22 23:29:19 sanaoe02 crmd[10989]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] On the nodes where the second resource should start, I get absoletly no logs *at all*! If I modify the ordering constraint to use a primitive as the first resource instead of a cloned resource, then everythong works ok.... and I get the following logs on the node where the the firt resource starts (very similar too the previous one) Mar 22 23:37:50 sanaoe02 crmd[10989]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Mar 22 23:37:50 sanaoe02 cib[10984]: notice: cib:diff: Diff: --- 0.920.3 Mar 22 23:37:50 sanaoe02 cib[10984]: notice: cib:diff: Diff: +++ 0.921.1 04b8247b3c6786c3ff15f583cf725c3d Mar 22 23:37:50 sanaoe02 cib[10984]: notice: cib:diff: -- <nvpair value="Stopped" id="pri_aoe1-meta_attributes-target-role"/> Mar 22 23:37:50 sanaoe02 cib[10984]: notice: cib:diff: ++ <nvpair id="pri_aoe1-meta_attributes-target-role" name="target-role" value="Started"/> Mar 22 23:37:50 sanaoe02 pengine[10988]: notice: unpack_rsc_op: Preventing pri_aoe1 from re-starting on dir01: operation monitor failed 'not installed' (5) Mar 22 23:37:50 sanaoe02 pengine[10988]: notice: unpack_rsc_op: Preventing pri_aoe1 from re-starting on mta02: operation monitor failed 'not installed' (5) Mar 22 23:37:50 sanaoe02 pengine[10988]: notice: unpack_rsc_op: Preventing pri_aoe1 from re-starting on ms02: operation monitor failed 'not installed' (5) Mar 22 23:37:50 sanaoe02 pengine[10988]: notice: unpack_rsc_op: Preventing pri_aoe1 from re-starting on mx02: operation monitor failed 'not installed' (5) Mar 22 23:37:50 sanaoe02 pengine[10988]: notice: unpack_rsc_op: Preventing pri_aoe1 from re-starting on dir02: operation monitor failed 'not installed' (5) Mar 22 23:37:50 sanaoe02 pengine[10988]: notice: LogActions: Start pri_dovecot:0#011(ms02) Mar 22 23:37:50 sanaoe02 pengine[10988]: notice: LogActions: Start pri_aoe1#011(sanaoe02) Mar 22 23:37:50 sanaoe02 crmd[10989]: notice: te_rsc_command: Initiating action 39: start pri_aoe1_start_0 on sanaoe02 (local) Mar 22 23:37:50 sanaoe02 pengine[10988]: notice: process_pe_message: Calculated Transition 381: /var/lib/pacemaker/pengine/pe-input-104.bz2 Mar 22 23:37:50 sanaoe02 AoEtarget(pri_aoe1)[14379]: INFO: Exporting device /dev/xvdb on eth1 as shelf 2, slot 1 Mar 22 23:37:50 sanaoe02 AoEtarget(pri_aoe1)[14379]: DEBUG: pri_aoe1 start : 0 Mar 22 23:37:50 sanaoe02 crmd[10989]: notice: process_lrm_event: LRM operation pri_aoe1_start_0 (call=198, rc=0, cib-update=1027, confirmed=true) ok Mar 22 23:37:50 sanaoe02 crmd[10989]: notice: te_rsc_command: Initiating action 25: start pri_dovecot_start_0 on ms02 Mar 22 23:37:50 sanaoe02 crmd[10989]: notice: te_rsc_command: Initiating action 26: monitor pri_dovecot_monitor_5000 on ms02 Mar 22 23:37:50 sanaoe02 crmd[10989]: notice: run_graph: Transition 381 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-104.bz2): Complete Mar 22 23:37:50 sanaoe02 crmd[10989]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] and where the second resource starts Mar 22 22:37:50 ms02 crmd[89496]: notice: process_lrm_event: LRM operation pri_dovecot_start_0 (call=151, rc=0, cib-update=197, confirmed=true) ok Mar 22 22:37:50 ms02 dovecot: master: Dovecot v2.1.7 starting up Mar 22 22:37:50 ms02 dovecot: master: Warning: /home is no longer mounted. If this is intentional, remove it with doveadm mount Mar 22 22:37:50 ms02 crmd[89496]: notice: process_lrm_event: LRM operation pri_dovecot_monitor_5000 (call=152, rc=0, cib-update=198, confirmed=false) ok I can't find anything usefull in those logs but if you think something is relevant or could be, please feel free to highlight. 2014-03-11 2:13 GMT+01:00 Andrew Beekhof <and...@beekhof.net>: > > On 9 Mar 2014, at 10:36 pm, Alexandre <alxg...@gmail.com> wrote: > >> So..., >> >> It appears the problem doesn't come from the primitive but for the >> cloned resource. If I use the primitive instead of the clone in the >> order constraint (thus deleting the clone and the group) , the second >> resource of the constraint startup as expected. >> >> Any idea why? > > Not without logs > >> >> Should I upgrade this pretty old version of pacemaker? > > Yes :) > >> >> 2014-03-08 10:36 GMT+01:00 Alexandre <alxg...@gmail.com>: >>> Hi Andrew, >>> >>> I have tried to stop and start the first resource of the ordering >>> constraint (cln_san), hoping it would trigger a start attemps of the >>> second resource of the ordering constraint (cln_mailstore). >>> I tailed the syslog logs on the node where I was expecting the second >>> resource to start but really nothing appeared in those logs (I grepped >>> 'pengine as per your suggestion). >>> >>> I have done another test, where I changed the first resource of the >>> ordering constraint with a very simple primitive (lsb resource), and >>> it worked in this case. >>> >>> I am wondering if the issue doesn't come from the rather complicated >>> first resource. It is a cloned group which contains a primitive >>> conditional instance attributes... >>> Are you aware of any specific issue in pacemaker 1.1.7 with this kind >>> of ressources? >>> >>> I will try to simplify the resources by getting rid of the conditional >>> instance attribute and try again. In the mean time I'd be delighted to >>> hear about what you guys think about that. >>> >>> Regards, Alex. >>> >>> 2014-03-07 4:21 GMT+01:00 Andrew Beekhof <and...@beekhof.net>: >>>> >>>> On 3 Mar 2014, at 3:56 am, Alexandre <alxg...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am setting up a cluster on debian wheezy. >>>>> I have installed pacemaker using the debian provided packages (so am >>>>> runing 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff). >>>>> >>>>> I have roughly 10 nodes, among which some nodes are acting as SAN >>>>> (exporting block devices using AoE protocol) and others nodes acting >>>>> as initiators (they are actually mail servers, storing emails on the >>>>> exported devices). >>>>> Bellow are the defined resources for those nodes: >>>>> >>>>> xml <primitive class="ocf" id="pri_aoe1" provider="heartbeat" >>>>> type="AoEtarget"> \ >>>>> <instance_attributes id="pri_aoe1.1-instance_attributes"> \ >>>>> <rule id="node-sanaoe01" score="1"> \ >>>>> <expression attribute="#uname" >>>>> id="expr-node-sanaoe01" operation="eq" value="sanaoe01"/> \ >>>>> </rule> \ >>>>> <nvpair id="pri_aoe1.1-instance_attributes-device" >>>>> name="device" value="/dev/xvdb"/> \ >>>>> <nvpair id="pri_aoe1.1-instance_attributes-nic" >>>>> name="nic" value="eth0"/> \ >>>>> <nvpair id="pri_aoe1.1-instance_attributes-shelf" >>>>> name="shelf" value="1"/> \ >>>>> <nvpair id="pri_aoe1.1-instance_attributes-slot" >>>>> name="slot" value="1"/> \ >>>>> </instance_attributes> \ >>>>> <instance_attributes id="pri_aoe2.1-instance_attributes"> \ >>>>> <rule id="node-sanaoe02" score="2"> \ >>>>> <expression attribute="#uname" >>>>> id="expr-node-sanaoe2" operation="eq" value="sanaoe02"/> \ >>>>> </rule> \ >>>>> <nvpair id="pri_aoe2.1-instance_attributes-device" >>>>> name="device" value="/dev/xvdb"/> \ >>>>> <nvpair id="pri_aoe2.1-instance_attributes-nic" >>>>> name="nic" value="eth1"/> \ >>>>> <nvpair id="pri_aoe2.1-instance_attributes-shelf" >>>>> name="shelf" value="2"/> \ >>>>> <nvpair id="pri_aoe2.1-instance_attributes-slot" >>>>> name="slot" value="1"/> \ >>>>> </instance_attributes> \ >>>>> </primitive> >>>>> primitive pri_dovecot lsb:dovecot \ >>>>> op start interval="0" timeout="20" \ >>>>> op stop interval="0" timeout="30" \ >>>>> op monitor interval="5" timeout="10" >>>>> primitive pri_spamassassin lsb:spamassassin \ >>>>> op start interval="0" timeout="50" \ >>>>> op stop interval="0" timeout="60" \ >>>>> op monitor interval="5" timeout="20" >>>>> group grp_aoe pri_aoe1 >>>>> group grp_mailstore pri_dlm pri_clvmd pri_spamassassin pri_dovecot >>>>> clone cln_mailstore grp_mailstore \ >>>>> meta ordered="false" interleave="true" clone-max="2" >>>>> clone cln_san grp_aoe \ >>>>> meta ordered="true" interleave="true" clone-max="2" >>>>> >>>>> As I am in an "opt-in cluster" mode (symmetric-cluster="false"), I >>>>> have the location constraints bellow for those hosts: >>>>> >>>>> location LOC_AOE_ETHERD_1 cln_san inf: sanaoe01 >>>>> location LOC_AOE_ETHERD_2 cln_san inf: sanaoe02 >>>>> location LOC_MAIL_STORE_1 cln_mailstore inf: ms01 >>>>> location LOC_MAIL_STORE_2 cln_mailstore inf: ms02 >>>>> >>>>> So far so good. I want to make sure the initiators won't try to search >>>>> for exported devices before the targets actually exported them. To do >>>>> so, I though I could use the following ordering constraint: >>>>> >>>>> order ORD_SAN_MAILSTORE inf: cln_san cln_mailstore >>>>> >>>>> Unfortunately if I add this constraint the clone Set "cln_mailstore" >>>>> never starts (or even stops if started when I add the constraint). >>>>> >>>>> Is there something wrong with this ordering rule? >>>>> Where can i find informations on what's going on? >>>> >>>> No errors in the logs? >>>> If you grep for 'pengine' does it want to start them or just leave them >>>> stopped? >>>> >>>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org