Re: [Pacemaker] weird drbd/cluster behaviour
On 27/06/2013, at 2:21 AM, Саша Александров wrote: > Hi! > > Fencing is disabled for now, the issue is not with fencing: the question is - > why only one out of three DRBD master-slave sets is recognized by pacemaker, Pacemaker knows nothing of drbd or any other kind of service. All that knowledge is encapsulated in the resource agent, so thats where I'd start looking. Maybe someone from linbit can chime in, since its their agent :) > even though all three drbd resources are active and configured properly... > > > 2013/6/26 Digimer > I don't see fencing/stonith configured. Without it, your cluster will > not be stable. You will get DRBD split-brains easily and depending in > what you use DRBD for, you could corrupt your data. > > On 06/25/2013 09:25 AM, Саша Александров wrote: > > Hi all! > > > > I am setting up a new cluster on OracleLinux 6.4 (well, it is CentOS 6.4). > > I went through http://clusterlabs.org/quickstart-redhat.html > > Then I installed DRBD 8.4.2 from elrepo. > > This setup is unusable :-( with DRBD 8.4.2. > > I created three DRBD resources: > > > > cat /proc/drbd > > version: 8.4.2 (api:1/proto:86-101) > > GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by > > root@flashfon1, 2013-06-24 22:08:41 > > 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r- > > ns:97659171 nr:0 dw:36 dr:97660193 al:1 bm:5961 lo:0 pe:0 ua:0 ap:0 > > ep:1 wo:f oos:0 > > 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r- > > ns:292421653 nr:16 dw:16 dr:292422318 al:0 bm:17848 lo:0 pe:0 ua:0 > > ap:0 ep:1 wo:f oos:0 > > 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r- > > ns:292421600 nr:8 dw:8 dr:292422265 al:0 bm:17848 lo:0 pe:0 ua:0 > > ap:0 ep:1 wo:f oos:0 > > > > It appeared that drbd resource-agent script did not work. Debugging > > showed that check_crm_feature_set() function always returned zeroes. Ok, > > just added 'exit' as its first line for now. > > > > Next, I created three drbd resources in pacemaker, three master-slave > > sets, three filesystem resources (and ip resources, but they are no > > problem): > > > > pcs status > > Last updated: Tue Jun 25 21:20:17 2013 > > Last change: Tue Jun 25 02:46:25 2013 via crm_resource on flashfon1 > > Stack: cman > > Current DC: flashfon1 - partition with quorum > > Version: 1.1.8-7.el6-394e906 > > 2 Nodes configured, unknown expected votes > > 11 Resources configured. > > > > > > Online: [ flashfon1 flashfon2 ] > > > > Full list of resources: > > > > Master/Slave Set: ms_wsoft [drbd_wsoft] > > Masters: [ flashfon1 ] > > Slaves: [ flashfon2 ] > > Master/Slave Set: ms_oradata [drbd_oradata] > > Slaves: [ flashfon1 flashfon2 ] > > Master/Slave Set: ms_flash [drbd_flash] > > Slaves: [ flashfon1 flashfon2 ] > > Resource Group: WcsGroup > > wcs_vip_local (ocf::heartbeat:IPaddr2): Started flashfon1 > > wcs_fs (ocf::heartbeat:Filesystem):Started flashfon1 > > Resource Group: OraGroup > > ora_vip_local (ocf::heartbeat:IPaddr2): Started flashfon1 > > oradata_fs (ocf::heartbeat:Filesystem):Stopped > > oraflash_fs(ocf::heartbeat:Filesystem):Stopped > > > > See, only one master-slave set is recognizing DRBD state! > > > > Resources are configured identically in CIB (except for drbd resource > > name parameter): > > > > > > > type="drbd"> > > > > > name="drbd_resource" value="wsoft"/> > > > > > > > > > > > > > >> name="master-max" value="1"/> > >> name="master-node-max" value="1"/> > >> name="clone-max" value="2"/> > >> name="clone-node-max" value="1"/> > >> value="true"/> > > > > > > > > > type="drbd"> > > > > > name="drbd_resource" value="oradata"/> > > > > > > > name="monitor"/> > > > > > > > >> name="master-max" value="1"/> > >> name="master-node-max" value="1"/> > >> name="clone-max" value="2"/> > >> name="clone-node-max" value="1"/> > >> value="true"/> > > > > > > > > I am stuck. :- > > > > Best regards, > > Alexandr A. Alexandrov > > > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > > > > -- > С уважением, ААА. > ___ > Pacemaker mailing l
Re: [Pacemaker] corosync stop and consequences
On 27/06/2013, at 3:40 AM, andreas graeper wrote: > thanks four your answer. > but still question open. > > when i switch off the active node: though this is done reliable for me, the > still passive node wants to know for sure and will kill the (already dead) > former active node. > i have no stonith-hardware (and i could not find till now stonith:null what > would help for the first) and so i'm still asking whether there is a clean > way to change roles ? > but maybe the message from Andrew has answered this already. he told: no ! > (if i understand right) More recent versions include "stonith_admin --confirm name" to say that a node is safely down after a manual inspection. But otherwise, no. There is no way to automatically recover the drbd resource _safely_ without fencing. You'd just be hoping that the other node really is dead. > > i thought of something like > crm resource demote m_drbd# on active node, and all that resources that > depend on drbd:master was stopped > crm resource promote m_drbd # on passive node after active node was > stopped and !!! the passive node was informed about the active is not active > any longer and does not need killed > > if pm actively stops all the resources and demotes drbd then it knows about > this condition ? at least if all stops and demotes succeed. > > and again to that drbd-handler organized location-constraint: > if it tells, that drbd:master must not started on other than n1, it could get > started on n1 if n1 would know for sure n2 has gone. > > > 2013/6/26 Michael Schwartzkopff > Am Mittwoch, 26. Juni 2013, 16:36:43 schrieb andreas graeper: > > hi and thanks. > > a primitive can be moved to another node. how can i move (change roles) of > > drbd:master to the other node ? > > Switch off the other node. > > > > and will all the depending resources follow? > > Depends on your constraints(order/collocation). > > > > then i can stop (and start again) corosync on passive node without problem > > (my tiny experience). > > Yes. > > > > another question: i found a location-constraint (i think it was set by some > > of the handlers in drbd-config): > > > > location drbd:master -INF #uname ne n1 > > does this mean: drbd must not get promoted on node other than n1 ?! > > Yes. See: Resource fencing in DRBD. > > > > i found this in a situation, when n2 had gone and drbd could not get > > promoted on n1, so all the other resources did not start. > > but the problem was not drbd, cause i could manually set it to primary. > > Yes. because this location constraint is valid inside the cluster software. > Not outside. > > -- > Dr. Michael Schwartzkopff > Guardinistr. 63 > 81375 München > > Tel: (0163) 172 50 98 > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Reminder: Pacemaker-1.1.10-rc5 is out there
On 28/06/2013, at 12:52 AM, Lars Marowsky-Bree wrote: >> Maybe you're right, maybe I should stop fighting it and go with the >> firefox approach. >> That certainly seemed to piss a lot of people off though... > > If there's one message I've learned in 13 years of work on Linux HA, > then it is that people are *always* going to be pissed off. ;-) Ok, so are you actually proposing we switch upstream to Pacemaker-{integer} releases, forsake any notion of stable versions and leave that to the enterprise distros? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Fixed! - Re: Problem with dual-PDU fencing node with redundant PSUs
On 06/26/2013 03:52 PM, Digimer wrote: > This question appears to be the same issue asked here: > > http://oss.clusterlabs.org/pipermail/pacemaker/2013-June/018650.html > > In my case, I have two fence methods per node; IPMI first with > action="reboot" and, if that fails, two PDUs (one backing each side of > the node's redundant PSUs). > > Initially I setup the PDUs as action "reboot" figuring that the > fence_toplogy tied them together, so pacemaker would call "pdu1:port1; > off -> pdu2:port1; off; (verify both are off) -> pdu1:port1; on -> > pdu2:port1; on". > > This didn't happen though. It called 'pdu1:port1; reboot' then > "pdu2:port1; reboot", so the first PSU in the node had it's power back > before the second PSU lost power, meaning the node never powered off. > > So next I tried; > > pdu1:port1; off -> pdu2:port1; off -> pdu1:port1; on -> pdu1:port1; on > > However, this seemed to have actually done; > > pdu1:port1; reboot -> pdu2:port1; reboot -> pdu1:port1; reboot -> > pdu1:port1; reboot > > So again, the node never lost power to both PSUs at the same time, so > the node didn't power off. > > This makes PDU fencing unreliable. I know beekhof said: > > "My point would be that action=off is not the correct way to configure > what you're trying to do." > > in the other thread, but there was no elaborating on what *is* the right > way. So if neither approach works, what is the proper way for configure > PDU fencing when you have two different PDUs backing either PSU? > > I don't want to disable "reboot" globally because I still want the > IPMI based fencing to do action="reboot". If I just do "off", then the > node will not power back on after a successful fence. This is better > than nothing, but still quite sub-optimal. So with the help of several people on IRC yesterday, I seem to have got this working. The trick was to sub-out "action" for "pcmk_reboot_action" ("action" is ignored and the global action is used). So the working configuration is (in crm syntax): node $id="1" an-c03n01.alteeve.ca node $id="2" an-c03n02.alteeve.ca primitive fence_n01_ipmi stonith:fence_ipmilan \ params ipaddr="an-c03n01.ipmi" pcmk_reboot_action="reboot" login="admin" passwd="secret" pcmk_host_list="an-c03n01.alteeve.ca" primitive fence_n01_psu1_off stonith:fence_apc_snmp \ params ipaddr="an-p01" pcmk_reboot_action="off" port="1" pcmk_host_list="an-c03n01.alteeve.ca" primitive fence_n01_psu1_on stonith:fence_apc_snmp \ params ipaddr="an-p01" pcmk_reboot_action="on" port="1" pcmk_host_list="an-c03n01.alteeve.ca" primitive fence_n01_psu2_off stonith:fence_apc_snmp \ params ipaddr="an-p02" pcmk_reboot_action="off" port="1" pcmk_host_list="an-c03n01.alteeve.ca" primitive fence_n01_psu2_on stonith:fence_apc_snmp \ params ipaddr="an-p02" pcmk_reboot_action="on" port="1" pcmk_host_list="an-c03n01.alteeve.ca" primitive fence_n02_ipmi stonith:fence_ipmilan \ params ipaddr="an-c03n02.ipmi" pcmk_reboot_action="reboot" login="admin" passwd="secret" pcmk_host_list="an-c03n02.alteeve.ca" \ meta target-role="Started" primitive fence_n02_psu1_off stonith:fence_apc_snmp \ params ipaddr="an-p01" pcmk_reboot_action="off" port="2" pcmk_host_list="an-c03n02.alteeve.ca" primitive fence_n02_psu1_on stonith:fence_apc_snmp \ params ipaddr="an-p01" pcmk_reboot_action="on" port="2" pcmk_host_list="an-c03n02.alteeve.ca" primitive fence_n02_psu2_off stonith:fence_apc_snmp \ params ipaddr="an-p02" pcmk_reboot_action="off" port="2" pcmk_host_list="an-c03n02.alteeve.ca" primitive fence_n02_psu2_on stonith:fence_apc_snmp \ params ipaddr="an-p02" pcmk_reboot_action="on" port="2" pcmk_host_list="an-c03n02.alteeve.ca" location loc_fence_n01_ipmi fence_n01_ipmi -inf: an-c03n01.alteeve.ca location loc_fence_n01_psu1_off fence_n01_psu1_off -inf: an-c03n01.alteeve.ca location loc_fence_n01_psu1_on fence_n01_psu1_on -inf: an-c03n01.alteeve.ca location loc_fence_n01_psu2_off fence_n01_psu2_off -inf: an-c03n01.alteeve.ca location loc_fence_n01_psu2_on fence_n01_psu2_on -inf: an-c03n01.alteeve.ca location loc_fence_n02_ipmi fence_n02_ipmi -inf: an-c03n02.alteeve.ca location loc_fence_n02_psu1_off fence_n02_psu1_off -inf: an-c03n02.alteeve.ca location loc_fence_n02_psu1_on fence_n02_psu1_on -inf: an-c03n02.alteeve.ca location loc_fence_n02_psu2_off fence_n02_psu2_off -inf: an-c03n02.alteeve.ca location loc_fence_n02_psu2_on fence_n02_psu2_on -inf: an-c03n02.alteeve.ca fencing_topology \ an-c03n01.alteeve.ca: fence_n01_ipmi fence_n01_psu1_off,fence_n01_psu2_off,fence_n01_psu1_on,fence_n01_psu2_on \ an-c03n02.alteeve.ca: fence_n02_ipmi fence_n02_psu1_off,fence_n02_psu2_off,fence_n02_psu1_on,fence_n02_psu2_on property $id="cib-bootstrap-options" \ dc-version="1.1.10-3.1733.a903e62.git.el7-a903e62" \ cluster-infrastructure="corosync" \ no-quorum-policy="ignore" \ stonith-enabled="true" Again, this is
Re: [Pacemaker] Problem with dual-PDU fencing node with redundant PSUs
On 06/27/2013 11:45 AM, Lars Marowsky-Bree wrote: > On 2013-06-27T11:32:36, Digimer wrote: > time and I expect many users will run into this problem as they try to migrate to RHEL 7. I see no reason why this can't be properly handled in pacemaker directly. >>> Yes, why not, choice is a good thing ;-) >> If an established configuration is not supported in the only remaining >> HA stack (RHEL 7 won't have cman/rgmanager anymore), then there is no >> choice. > > What I meant to say: I don't see why that couldn't be fixed, if it's > considered important for RHEL7. Ah, yes. I do hope this is the case. I've been talking to two or three other people on freenode's #linux-ha who are trying to sort out this exact configuration and none of us have been successful yet. I am sure it is something easy to someone experienced in pacemaker, but it seems to be fairly non-obvious to those of us earlier in the learning process. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Problem configuring a simple ping resource
Effectively Lars, just editing the attribute symmetric cluster and setting it to true made the resource work. This also has helped me to clarify the concepts of assymetric and symetric clusters. Thanks, Jacobo García On Thu, Jun 27, 2013 at 5:10 PM, Lars Marowsky-Bree wrote: > On 2013-06-27T17:01:37, Jacobo García wrote: > >> Enable assymetric clustering >> crm_attribute --attr-name symmetric-cluster --attr-value false >> >> Then I configure the resource: >> crm configure primitive ping ocf:pacemaker:ping params >> host_list="10.34.151.73" op monitor interval=15s timeout=5s >> WARNING: ping: default timeout 20s for start is smaller than the advised 60 >> WARNING: ping: specified timeout 5s for monitor is smaller than the advised >> 60 >> >> This is what I get now: >> >> crm status >> >> Last updated: Wed Jun 26 16:49:20 2013 >> Last change: Wed Jun 26 16:48:01 2013 via cibadmin on ip-10-35-147-209 >> Stack: openais >> Current DC: ip-10-35-147-209 - partition with quorum >> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c >> 2 Nodes configured, 2 expected votes >> 1 Resources configured. >> >> >> Online: [ ip-10-34-151-73 ip-10-35-147-209 ] >> >> crm resource list >> ping (ocf::pacemaker:ping) Stopped > > You've configured the cluster so that no node is eligible to run > resources unless you allow it (when you turned off symmetric > clustering). So you need to provide explicit location constraints to > allow the ping resource to run. (And you probably want to clone it.) > > Or, easiest, don't switch to asymmetric clustering. I don't think that's > what you want. > > > Regards, > Lars > > -- > Architect Storage/HA > SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, > HRB 21284 (AG Nürnberg) > "Experience is the name everyone gives to their mistakes." -- Oscar Wilde > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Problem with dual-PDU fencing node with redundant PSUs
On 2013-06-27T11:32:36, Digimer wrote: > >> time and I expect many users will run into this problem as they try to > >> migrate to RHEL 7. I see no reason why this can't be properly handled in > >> pacemaker directly. > > Yes, why not, choice is a good thing ;-) > If an established configuration is not supported in the only remaining > HA stack (RHEL 7 won't have cman/rgmanager anymore), then there is no > choice. What I meant to say: I don't see why that couldn't be fixed, if it's considered important for RHEL7. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Problem with dual-PDU fencing node with redundant PSUs
On 06/27/2013 11:08 AM, Lars Marowsky-Bree wrote: > On 2013-06-27T10:56:40, Digimer wrote: > >> However, this feels like a really bad solution. It's not uncommon to >> have two separate power rails feeding either side of the node's PSUs. >> Particularly in HA environments. > > True. But gating them through the same power switch is *not* a SPoF from > the cluster's perspective, "just" for the single node (if the power > switch fails). Actually, it is. If the single PDU dies, then you lose IPMI based fencing and PDU fencing, so the cluster can not (safely) recover, leaving things blocked until a human intervenes. > On the other hand, each of the two switches/PDUs (and the network > interconnect to each) becomes a SPoF for *fencing* the node, since you > need an ACK from both; two PDU approval from both. Basically, that > doubles the unreliability of the environment. And, if, indeed, you lose > power to one of the grids, *you can no longer fence* via this > mechanism. In my setups, this is not the case. I use two switches (all nodes are bonded across the two) with IPMI in the first switch and the PDUs in the second switch. If a PDU dies, then the IPMI interface is still powered and will work for fencing. If a switch dies, then either IPMI or the PDU fencing remains available. So no single failure will take out fencing. > Thus, this only makes sense as a fall-back mechanism, obviously. If we > have both (say, IPMI + dual switch), we actually want to not try them in > sequence though, but in parallel - to lower recovery time. (Waiting for > the IPMI network timeout isn't nice.) I want them in sequence because an IPMI "success" is more trust-worthy than the PDUs "success" (which says only "yes, the outlets were opened). You are right, it slows down recovery, but the PDUs should not be needed except in corner cases, so given the preference to IPMI fencing, I am willing to incur the extra recovery time in such cases. > Personally, I've tried to discourage users from building such > environments. Since most of our customers have something like shared > storage, I much prefer shared storage based fencing these days. An alternate philosophy. :) >> time and I expect many users will run into this problem as they try to >> migrate to RHEL 7. I see no reason why this can't be properly handled in >> pacemaker directly. > > Yes, why not, choice is a good thing ;-) If an established configuration is not supported in the only remaining HA stack (RHEL 7 won't have cman/rgmanager anymore), then there is no choice. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Problem configuring a simple ping resource
On 2013-06-27T17:01:37, Jacobo García wrote: > Enable assymetric clustering > crm_attribute --attr-name symmetric-cluster --attr-value false > > Then I configure the resource: > crm configure primitive ping ocf:pacemaker:ping params > host_list="10.34.151.73" op monitor interval=15s timeout=5s > WARNING: ping: default timeout 20s for start is smaller than the advised 60 > WARNING: ping: specified timeout 5s for monitor is smaller than the advised 60 > > This is what I get now: > > crm status > > Last updated: Wed Jun 26 16:49:20 2013 > Last change: Wed Jun 26 16:48:01 2013 via cibadmin on ip-10-35-147-209 > Stack: openais > Current DC: ip-10-35-147-209 - partition with quorum > Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c > 2 Nodes configured, 2 expected votes > 1 Resources configured. > > > Online: [ ip-10-34-151-73 ip-10-35-147-209 ] > > crm resource list > ping (ocf::pacemaker:ping) Stopped You've configured the cluster so that no node is eligible to run resources unless you allow it (when you turned off symmetric clustering). So you need to provide explicit location constraints to allow the ping resource to run. (And you probably want to clone it.) Or, easiest, don't switch to asymmetric clustering. I don't think that's what you want. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Problem with dual-PDU fencing node with redundant PSUs
On 2013-06-27T10:56:40, Digimer wrote: > However, this feels like a really bad solution. It's not uncommon to > have two separate power rails feeding either side of the node's PSUs. > Particularly in HA environments. True. But gating them through the same power switch is *not* a SPoF from the cluster's perspective, "just" for the single node (if the power switch fails). On the other hand, each of the two switches/PDUs (and the network interconnect to each) becomes a SPoF for *fencing* the node, since you need an ACK from both; two PDU approval from both. Basically, that doubles the unreliability of the environment. And, if, indeed, you lose power to one of the grids, *you can no longer fence* via this mechanism. Thus, this only makes sense as a fall-back mechanism, obviously. If we have both (say, IPMI + dual switch), we actually want to not try them in sequence though, but in parallel - to lower recovery time. (Waiting for the IPMI network timeout isn't nice.) Personally, I've tried to discourage users from building such environments. Since most of our customers have something like shared storage, I much prefer shared storage based fencing these days. > time and I expect many users will run into this problem as they try to > migrate to RHEL 7. I see no reason why this can't be properly handled in > pacemaker directly. Yes, why not, choice is a good thing ;-) Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Problem configuring a simple ping resource
Hello I am trying to configure a simple ping resource in order to start understanding the mechanics under pacemaker. I have sucessfully configured corosync between 2 EC2 nodes on the same region, with the following configuration: logging{ to_logfile: yes logfile: /var/log/corosync.log } totem { version: 2 secauth: on interface { member { memberaddr: 10.35.147.209 } member { memberaddr: 10.34.151.73 } ringnumber: 0 bindnetaddr: 10.35.147.209 mcastport: 694 ttl: 1 } transport: udpu token: 1 } service { # Load the Pacemaker Cluster Resource Manager ver: 0 name: pacemaker } The other node has the opposite ip addresses on memberaddr and bindnetaddr, the UDP port 694 is open on the Amazon Security Groups, everything looks fine crm status Last updated: Wed Jun 26 16:44:40 2013 Last change: Wed Jun 26 16:44:32 2013 via crmd on ip-10-35-147-209 Stack: openais Current DC: ip-10-35-147-209 - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes 0 Resources configured. Online: [ ip-10-34-151-73 ip-10-35-147-209 ] Afterwards: Disable STONITH crm_attribute -t crm_config -n stonith-enabled -v false Disable quorum crm_attribute -n no-quorum-policy -v ignore Enable assymetric clustering crm_attribute --attr-name symmetric-cluster --attr-value false Then I configure the resource: crm configure primitive ping ocf:pacemaker:ping params host_list="10.34.151.73" op monitor interval=15s timeout=5s WARNING: ping: default timeout 20s for start is smaller than the advised 60 WARNING: ping: specified timeout 5s for monitor is smaller than the advised 60 This is what I get now: crm status Last updated: Wed Jun 26 16:49:20 2013 Last change: Wed Jun 26 16:48:01 2013 via cibadmin on ip-10-35-147-209 Stack: openais Current DC: ip-10-35-147-209 - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes 1 Resources configured. Online: [ ip-10-34-151-73 ip-10-35-147-209 ] crm resource list ping (ocf::pacemaker:ping) Stopped crm configure show node ip-10-34-151-73 node ip-10-35-147-209 primitive ping ocf:pacemaker:ping \ params host_list="10.34.151.73" \ op monitor interval="15s" timeout="5s" property $id="cib-bootstrap-options" \ dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ symmetric-cluster="false" In this link I have pasted the whole contents of the log of both servers: https://gist.github.com/therobot/c650149aa52e36a29ba6 And in this one I have pasted the output of running `cibfadmin -Q`: https://gist.github.com/therobot/39e36bfb07839086c3db Is there anything else to have this configured? Browsing the log it seems that the resource pingd cannot run anywhere. I am missing something? Thanks in advance for the help, Jacobo García ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Problem with dual-PDU fencing node with redundant PSUs
On 2013-06-27T16:52:02, Dejan Muhamedagic wrote: > > I don't want the cluster stack to start on boot, so I disable > > pacemaker/corosync. However, I do want the node to power back on so that > > I can log into it when the alarms go off. Yes, I could log into the good > > node, manually unfence/boot it and then log in, but this adds minutes to > > the MTTR that I would reay like to avoid. > Certainly it adds a bit of time, but only to the node's MTTR, > not the cluster's MTTR. Anyway, if pacemaker can turn off the > node, then a short script can also turn it on. sbd has gained a mechanism that allows the rebooting node to see if it was fenced, and only then to not start the cluster stack. I wonder if something similar could be adapted to other fencing mechanisms? (It is somewhat similar to corosync's new two node quorum handling, I guess.) Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Problem with dual-PDU fencing node with redundant PSUs
On 06/27/2013 10:52 AM, Dejan Muhamedagic wrote: > On Thu, Jun 27, 2013 at 09:54:13AM -0400, Digimer wrote: >> On 06/27/2013 07:02 AM, Dejan Muhamedagic wrote: >>> Hi, >>> >>> On Wed, Jun 26, 2013 at 03:52:00PM -0400, Digimer wrote: This question appears to be the same issue asked here: http://oss.clusterlabs.org/pipermail/pacemaker/2013-June/018650.html In my case, I have two fence methods per node; IPMI first with action="reboot" and, if that fails, two PDUs (one backing each side of the node's redundant PSUs). Initially I setup the PDUs as action "reboot" figuring that the fence_toplogy tied them together, so pacemaker would call "pdu1:port1; off -> pdu2:port1; off; (verify both are off) -> pdu1:port1; on -> pdu2:port1; on". This didn't happen though. It called 'pdu1:port1; reboot' then "pdu2:port1; reboot", so the first PSU in the node had it's power back before the second PSU lost power, meaning the node never powered off. >>> >>> I'm not sure if that's supported. >> >> Unless I am misunderstood, beekhof indicated that it is/should be. > > I'm pretty sure that it's not, but perhaps things changed in the > meantime. At least it wasn't when we discussed the > implementation. > So next I tried; pdu1:port1; off -> pdu2:port1; off -> pdu1:port1; on -> pdu1:port1; on However, this seemed to have actually done; pdu1:port1; reboot -> pdu2:port1; reboot -> pdu1:port1; reboot -> pdu1:port1; reboot So again, the node never lost power to both PSUs at the same time, so the node didn't power off. This makes PDU fencing unreliable. I know beekhof said: "My point would be that action=off is not the correct way to configure what you're trying to do." in the other thread, but there was no elaborating on what *is* the right way. So if neither approach works, what is the proper way for configure PDU fencing when you have two different PDUs backing either PSU? >>> >>> The fence action needs to be defined in the cluster properties >>> (crm_config/cluster_property_set in XML): >>> >>> # crm configure property stonith-action=off >>> >>> See the output of: >>> >>> $ crm ra info pengine >>> >>> for the PE metadata and explanation of properties. >> >> In irc last night, beekhof mentioned that action="..." is ignored and >> replaced. However, it would appear that pcmk_reboot_action="..." should >> force the issue. I'm planning to test this today. > > Yes, true, though it's a bit of a kludge > (pcmk_reboot_action="off" if I got that right). > I don't want to disable "reboot" globally because I still want the IPMI based fencing to do action="reboot". >>> >>> I don't think it is possible to define a per-resource fencing >>> action. >>> If I just do "off", then the node will not power back on after a successful fence. This is better than nothing, but still quite sub-optimal. >>> >>> Yes, if you want to start the cluster stack automatically on >>> reboot. Anyway, I think that it would be preferred to let a human >>> check why the node got fenced before letting it join the cluster >>> again. In that case, one just needs to boot the host manually. >>> >>> Thanks, >>> >>> Dejan >> >> I don't want the cluster stack to start on boot, so I disable >> pacemaker/corosync. However, I do want the node to power back on so that >> I can log into it when the alarms go off. Yes, I could log into the good >> node, manually unfence/boot it and then log in, but this adds minutes to >> the MTTR that I would reay like to avoid. > > Certainly it adds a bit of time, but only to the node's MTTR, > not the cluster's MTTR. Anyway, if pacemaker can turn off the > node, then a short script can also turn it on. > > Cheers, > > Dejan If I need to write a script, I will instead write a new fence agent that handles multiple PDUs in a sensible fashion. I'm already thinking of "fence_apc_multi" that takes a string of addresses and ports and does a clean "off" on all, verifies all are off, then an "on" on all. This would make the pacemaker config a lot simpler and cleaner and allow "reboot" to remain the default action. However, this feels like a really bad solution. It's not uncommon to have two separate power rails feeding either side of the node's PSUs. Particularly in HA environments. RHCS has supported this for a very long time and I expect many users will run into this problem as they try to migrate to RHEL 7. I see no reason why this can't be properly handled in pacemaker directly. I'm hoping it is and I am just too new to pacemaker to realize my mistake. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/list
Re: [Pacemaker] Reminder: Pacemaker-1.1.10-rc5 is out there
On 2013-06-27T20:50:34, Andrew Beekhof wrote: > There was one :-) > I merged the best bits of three parallel CPG code paths. > The things that prompted the extra bits in one also applied to the others. Ah, that wasn't so obvious to me when I tried making sense of the commit. ;-) But that's cleared up now. > Maybe you're right, maybe I should stop fighting it and go with the > firefox approach. > That certainly seemed to piss a lot of people off though... If there's one message I've learned in 13 years of work on Linux HA, then it is that people are *always* going to be pissed off. ;-) > Then I guess I'm confused by what you meant by: > >>> Distributions can take care of them when they integrate them; basically > >>> they'll trickle through until the whole stack the distributions ship > >>> builds again. > Didn't that relate to essentially holding back not-critical-to-you changes? Yes and no. Not in the sense of backports or selective back-outs (as long as I can avoid them), but in as only releasing the update once the downstream dependencies have also been adjusted. > The day you suggest every message include a unique and immutable ID... > lets just say you'd better start polishing that katana. Trying to find the time to look more into SNMP interfaces and traps. That'd probably address some more of the concerns about customers having to churn through the logs. And, of course, tools like crm shell's history explorer help a lot, because that abstracts the automatic parsing from users. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Problem with dual-PDU fencing node with redundant PSUs
On Thu, Jun 27, 2013 at 09:54:13AM -0400, Digimer wrote: > On 06/27/2013 07:02 AM, Dejan Muhamedagic wrote: > > Hi, > > > > On Wed, Jun 26, 2013 at 03:52:00PM -0400, Digimer wrote: > >> This question appears to be the same issue asked here: > >> > >> http://oss.clusterlabs.org/pipermail/pacemaker/2013-June/018650.html > >> > >> In my case, I have two fence methods per node; IPMI first with > >> action="reboot" and, if that fails, two PDUs (one backing each side of > >> the node's redundant PSUs). > >> > >> Initially I setup the PDUs as action "reboot" figuring that the > >> fence_toplogy tied them together, so pacemaker would call "pdu1:port1; > >> off -> pdu2:port1; off; (verify both are off) -> pdu1:port1; on -> > >> pdu2:port1; on". > >> > >> This didn't happen though. It called 'pdu1:port1; reboot' then > >> "pdu2:port1; reboot", so the first PSU in the node had it's power back > >> before the second PSU lost power, meaning the node never powered off. > > > > I'm not sure if that's supported. > > Unless I am misunderstood, beekhof indicated that it is/should be. I'm pretty sure that it's not, but perhaps things changed in the meantime. At least it wasn't when we discussed the implementation. > >> So next I tried; > >> > >> pdu1:port1; off -> pdu2:port1; off -> pdu1:port1; on -> pdu1:port1; on > >> > >> However, this seemed to have actually done; > >> > >> pdu1:port1; reboot -> pdu2:port1; reboot -> pdu1:port1; reboot -> > >> pdu1:port1; reboot > >> > >> So again, the node never lost power to both PSUs at the same time, so > >> the node didn't power off. > >> > >> This makes PDU fencing unreliable. I know beekhof said: > >> > >> "My point would be that action=off is not the correct way to configure > >> what you're trying to do." > >> > >> in the other thread, but there was no elaborating on what *is* the right > >> way. So if neither approach works, what is the proper way for configure > >> PDU fencing when you have two different PDUs backing either PSU? > > > > The fence action needs to be defined in the cluster properties > > (crm_config/cluster_property_set in XML): > > > > # crm configure property stonith-action=off > > > > See the output of: > > > > $ crm ra info pengine > > > > for the PE metadata and explanation of properties. > > In irc last night, beekhof mentioned that action="..." is ignored and > replaced. However, it would appear that pcmk_reboot_action="..." should > force the issue. I'm planning to test this today. Yes, true, though it's a bit of a kludge (pcmk_reboot_action="off" if I got that right). > >> I don't want to disable "reboot" globally because I still want the > >> IPMI based fencing to do action="reboot". > > > > I don't think it is possible to define a per-resource fencing > > action. > > > >> If I just do "off", then the > >> node will not power back on after a successful fence. This is better > >> than nothing, but still quite sub-optimal. > > > > Yes, if you want to start the cluster stack automatically on > > reboot. Anyway, I think that it would be preferred to let a human > > check why the node got fenced before letting it join the cluster > > again. In that case, one just needs to boot the host manually. > > > > Thanks, > > > > Dejan > > I don't want the cluster stack to start on boot, so I disable > pacemaker/corosync. However, I do want the node to power back on so that > I can log into it when the alarms go off. Yes, I could log into the good > node, manually unfence/boot it and then log in, but this adds minutes to > the MTTR that I would reay like to avoid. Certainly it adds a bit of time, but only to the node's MTTR, not the cluster's MTTR. Anyway, if pacemaker can turn off the node, then a short script can also turn it on. Cheers, Dejan > cheers > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Problem with dual-PDU fencing node with redundant PSUs
On 06/27/2013 07:02 AM, Dejan Muhamedagic wrote: > Hi, > > On Wed, Jun 26, 2013 at 03:52:00PM -0400, Digimer wrote: >> This question appears to be the same issue asked here: >> >> http://oss.clusterlabs.org/pipermail/pacemaker/2013-June/018650.html >> >> In my case, I have two fence methods per node; IPMI first with >> action="reboot" and, if that fails, two PDUs (one backing each side of >> the node's redundant PSUs). >> >> Initially I setup the PDUs as action "reboot" figuring that the >> fence_toplogy tied them together, so pacemaker would call "pdu1:port1; >> off -> pdu2:port1; off; (verify both are off) -> pdu1:port1; on -> >> pdu2:port1; on". >> >> This didn't happen though. It called 'pdu1:port1; reboot' then >> "pdu2:port1; reboot", so the first PSU in the node had it's power back >> before the second PSU lost power, meaning the node never powered off. > > I'm not sure if that's supported. Unless I am misunderstood, beekhof indicated that it is/should be. >> So next I tried; >> >> pdu1:port1; off -> pdu2:port1; off -> pdu1:port1; on -> pdu1:port1; on >> >> However, this seemed to have actually done; >> >> pdu1:port1; reboot -> pdu2:port1; reboot -> pdu1:port1; reboot -> >> pdu1:port1; reboot >> >> So again, the node never lost power to both PSUs at the same time, so >> the node didn't power off. >> >> This makes PDU fencing unreliable. I know beekhof said: >> >> "My point would be that action=off is not the correct way to configure >> what you're trying to do." >> >> in the other thread, but there was no elaborating on what *is* the right >> way. So if neither approach works, what is the proper way for configure >> PDU fencing when you have two different PDUs backing either PSU? > > The fence action needs to be defined in the cluster properties > (crm_config/cluster_property_set in XML): > > # crm configure property stonith-action=off > > See the output of: > > $ crm ra info pengine > > for the PE metadata and explanation of properties. In irc last night, beekhof mentioned that action="..." is ignored and replaced. However, it would appear that pcmk_reboot_action="..." should force the issue. I'm planning to test this today. >> I don't want to disable "reboot" globally because I still want the >> IPMI based fencing to do action="reboot". > > I don't think it is possible to define a per-resource fencing > action. > >> If I just do "off", then the >> node will not power back on after a successful fence. This is better >> than nothing, but still quite sub-optimal. > > Yes, if you want to start the cluster stack automatically on > reboot. Anyway, I think that it would be preferred to let a human > check why the node got fenced before letting it join the cluster > again. In that case, one just needs to boot the host manually. > > Thanks, > > Dejan I don't want the cluster stack to start on boot, so I disable pacemaker/corosync. However, I do want the node to power back on so that I can log into it when the alarms go off. Yes, I could log into the good node, manually unfence/boot it and then log in, but this adds minutes to the MTTR that I would reay like to avoid. cheers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] WARNINGS and ERRORS on syslog after update to 1.1.7
On 25/06/2013, at 9:44 PM, Francesco Namuri wrote: >> Can you attach /var/lib/pengine/pe-input-64.bz2 from SERVERNAME1 please? >> >> I'll be able to see if its something we've already fixed. Nope still there. I will attempt to fix this tomorrow. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Node name problems after upgrading to 1.1.9
On 27/06/2013, at 10:29 PM, Bernardo Cabezas Serra wrote: > Hello, > > Ohhh, sorry, but I have deleted node selavi and restarted, and now works > OK and I can't reproduce the bug :( That is unfortunate > > El 27/06/13 12:32, Andrew Beekhof escribió: >> o, more likely a bug. Which is concerning since I thought I had this >> particular kind ironed out. >> >> Could you set PCMK_trace_functions=crm_get_peer on selavi and repeat the >> test? >> >> The exact way to do this will depend on your distro (which is?). >> On most rpm based distros its done in /etc/sysconfig/pacemaker > > the distro is ubuntu 12.04 LTS (precise) > > As I have installed all in /opt/ha, file is /opt/ha/etc/default/pacemaker. > The setting is only for debug, isn't it? I can't see more info, but now > it works. You should see additional logs sent to /var/log/pacemaker.log > > The strange thing is that I previously deleted both nodes, and cleaned > up selavi var/lib/pacemaker state, but it continued to fail. > > The long explanation of what I have done: > - Reverting back selavi corosync to 2.3.0 (turifel still on 2.3.0.66 git > version) > - On starting, issue persisted > - On turifel, crm node status, I saw two nodes with same uname (selavi) > but different Ids. > - From turifel crm, deleted selavi two times (with some warnings). > - Put the cluster unmanaged > - Rebooted turifel node (because of a dlm lock failure due to wrong > corosync stop) > - Set up PCMK_trace_functions=crm_get_peer setting on both nodes. > - On start, all worked :/ > - Upgraded back selavi corosync to 2.3.0.66 git > - Still working. > > > Will continue trying to reproduce issue. > > PS: the PCMK_trace_functions environ is only for debugging purposes, > isn't it? (I mean: can't have resolved the issue) Correct > > Thans so much for your help! > Regards, > Bernardo > > -- > APSL > *Bernardo Cabezas Serra* > *Responsable Sistemas* > Camí Vell de Bunyola 37, esc. A, local 7 > 07009 Polígono de Son Castelló, Palma > Mail: bcabe...@apsl.net > Skype: bernat.cabezas > Tel: 971439771 > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Node name problems after upgrading to 1.1.9
Hi Emmanuel, El 27/06/13 12:26, emmanuel segura escribió: > Hello Bernardo > I don't know if this is the problem, but try this option > clear_node_high_bit Thanks so much, but as i stated to Andrew I can't reproduce issue any more. Now works fine after some rebooting and node deleting. Best regards -- APSL *Bernardo Cabezas Serra* *Responsable Sistemas* Camí Vell de Bunyola 37, esc. A, local 7 07009 Polígono de Son Castelló, Palma Mail: bcabe...@apsl.net Skype: bernat.cabezas Tel: 971439771 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Node name problems after upgrading to 1.1.9
Hello, Ohhh, sorry, but I have deleted node selavi and restarted, and now works OK and I can't reproduce the bug :( El 27/06/13 12:32, Andrew Beekhof escribió: > o, more likely a bug. Which is concerning since I thought I had this > particular kind ironed out. > > Could you set PCMK_trace_functions=crm_get_peer on selavi and repeat the test? > > The exact way to do this will depend on your distro (which is?). > On most rpm based distros its done in /etc/sysconfig/pacemaker the distro is ubuntu 12.04 LTS (precise) As I have installed all in /opt/ha, file is /opt/ha/etc/default/pacemaker. The setting is only for debug, isn't it? I can't see more info, but now it works. The strange thing is that I previously deleted both nodes, and cleaned up selavi var/lib/pacemaker state, but it continued to fail. The long explanation of what I have done: - Reverting back selavi corosync to 2.3.0 (turifel still on 2.3.0.66 git version) - On starting, issue persisted - On turifel, crm node status, I saw two nodes with same uname (selavi) but different Ids. - From turifel crm, deleted selavi two times (with some warnings). - Put the cluster unmanaged - Rebooted turifel node (because of a dlm lock failure due to wrong corosync stop) - Set up PCMK_trace_functions=crm_get_peer setting on both nodes. - On start, all worked :/ - Upgraded back selavi corosync to 2.3.0.66 git - Still working. Will continue trying to reproduce issue. PS: the PCMK_trace_functions environ is only for debugging purposes, isn't it? (I mean: can't have resolved the issue) Thans so much for your help! Regards, Bernardo -- APSL *Bernardo Cabezas Serra* *Responsable Sistemas* Camí Vell de Bunyola 37, esc. A, local 7 07009 Polígono de Son Castelló, Palma Mail: bcabe...@apsl.net Skype: bernat.cabezas Tel: 971439771 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [OT] MySQL Replication
On Thu, 27 Jun 2013 21:59:29 +1000 Andrew Beekhof wrote: > I would _highly_ recommend 0.14 for use with any version of pacemaker > that uses libqb. Hi Andrew, using the libqb-dev from Debian Sid I was able to compile the recent pacemaker version (from github) without errors. I didn't had time yet to check if this version will restart all my services after a restored quorum (see "Recovery after lost quorum"-Thread on this list), but I'll check that later. Best regards Denis Witt ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [OT] MySQL Replication
On 27/06/2013, at 1:53 AM, Denis Witt wrote: > On Wed, 26 Jun 2013 21:33:30 +1000 > Andrew Beekhof wrote: > >>> When you run ./autogen.sh it tries to start an rpm command, this >>> failed because I didn't had rpm installed. >> >> How did it fail? >> That whole block if intended to be skipped if rpm isn't available. >> >>if [ -e `which rpm` ]; then >> echo "Suggested invocation:" >> rpm --eval %{configure} | grep -v program-prefix >>fi > > ./autogen.sh > autoreconf: Entering directory `.' > autoreconf: configure.ac: not using Gettext > autoreconf: running: aclocal --force --warnings=no-portability -I m4 > autoreconf: configure.ac: tracing > autoreconf: configure.ac: subdirectory libltdl not present > autoreconf: running: libtoolize --force > libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, `.'. > libtoolize: linking file `./ltmain.sh' > libtoolize: putting macros in `m4'. > libtoolize: linking file `m4/libtool.m4' > libtoolize: linking file `m4/ltoptions.m4' > libtoolize: linking file `m4/ltsugar.m4' > libtoolize: linking file `m4/ltversion.m4' > libtoolize: linking file `m4/lt~obsolete.m4' > libtoolize: Consider adding `AC_CONFIG_MACRO_DIR([m4])' to configure.ac > and libtoolize: rerunning libtoolize, to keep the correct libtool > macros in-tree. autoreconf: running: /usr/bin/autoconf --force > --warnings=no-portability autoreconf: running: /usr/bin/autoheader > --force --warnings=no-portability autoreconf: running: automake > --add-missing --force-missing --warnings=no-portability > fencing/Makefile.am:91: `CFLAGS' is a user variable, you should not > override it; fencing/Makefile.am:91: use `AM_CFLAGS' instead. > lib/ais/Makefile.am:33: `CFLAGS' is a user variable, you should not > override it; lib/ais/Makefile.am:33: use `AM_CFLAGS' instead. > lib/common/Makefile.am:33: `CFLAGS' is a user variable, you should not > override it; lib/common/Makefile.am:33: use `AM_CFLAGS' instead. > autoreconf: Leaving directory `.' Now run ./configure > Now run configure with any arguments (eg. --prefix) specific to your > system Suggested invocation: > ./autogen.sh: 38: ./autogen.sh: rpm: not found > > Nothing serious... Ah, I'll just suppress that message. > > >>> Which is the latest pacemaker version that is compatible with libqb >>> 0.11? > >> Ummm I don't know to be honest. > > May it be possible that the version I tried to compile two or three > weeks ago was compatible, as I was able to run ./configure (by hand) > successfully? No. Just that the configure check was broken. > >> Any reason not to upgrade it? > > I'd like to stick with the debian repository version to receive > updates. For testing I might could use the version from sid which is > 0.14. I would _highly_ recommend 0.14 for use with any version of pacemaker that uses libqb. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Problem with dual-PDU fencing node with redundant PSUs
Hi, On Wed, Jun 26, 2013 at 03:52:00PM -0400, Digimer wrote: > This question appears to be the same issue asked here: > > http://oss.clusterlabs.org/pipermail/pacemaker/2013-June/018650.html > > In my case, I have two fence methods per node; IPMI first with > action="reboot" and, if that fails, two PDUs (one backing each side of > the node's redundant PSUs). > > Initially I setup the PDUs as action "reboot" figuring that the > fence_toplogy tied them together, so pacemaker would call "pdu1:port1; > off -> pdu2:port1; off; (verify both are off) -> pdu1:port1; on -> > pdu2:port1; on". > > This didn't happen though. It called 'pdu1:port1; reboot' then > "pdu2:port1; reboot", so the first PSU in the node had it's power back > before the second PSU lost power, meaning the node never powered off. I'm not sure if that's supported. > So next I tried; > > pdu1:port1; off -> pdu2:port1; off -> pdu1:port1; on -> pdu1:port1; on > > However, this seemed to have actually done; > > pdu1:port1; reboot -> pdu2:port1; reboot -> pdu1:port1; reboot -> > pdu1:port1; reboot > > So again, the node never lost power to both PSUs at the same time, so > the node didn't power off. > > This makes PDU fencing unreliable. I know beekhof said: > > "My point would be that action=off is not the correct way to configure > what you're trying to do." > > in the other thread, but there was no elaborating on what *is* the right > way. So if neither approach works, what is the proper way for configure > PDU fencing when you have two different PDUs backing either PSU? The fence action needs to be defined in the cluster properties (crm_config/cluster_property_set in XML): # crm configure property stonith-action=off See the output of: $ crm ra info pengine for the PE metadata and explanation of properties. > I don't want to disable "reboot" globally because I still want the > IPMI based fencing to do action="reboot". I don't think it is possible to define a per-resource fencing action. > If I just do "off", then the > node will not power back on after a successful fence. This is better > than nothing, but still quite sub-optimal. Yes, if you want to start the cluster stack automatically on reboot. Anyway, I think that it would be preferred to let a human check why the node got fenced before letting it join the cluster again. In that case, one just needs to boot the host manually. Thanks, Dejan > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Reminder: Pacemaker-1.1.10-rc5 is out there
On 27/06/2013, at 5:40 PM, Lars Marowsky-Bree wrote: > On 2013-06-27T14:28:19, Andrew Beekhof wrote: > >> I wouldn't say the 6 months between 1.1.7 and 1.1.8 was a particularly >> aggressive release cycle. > > For the amount of changes in there, I think yes. And the intrusive ones > didn't show up all at the beginning of that cycle, either. That just > made integration difficult - basically, we ended up skipping 1.1.8/1.1.9 > and went mostly with 1.1.10-rcx. > >> Generally though, it has always been hard/dangerous to backport >> specific fixes because of the complexity of the interactions - >> particularly in the PE. > > Yes, that's why we try to avoid backports ;-) > >>> Yeah, that one. If it fixes a bug, it was probably unavoidable >>> (though the specific commit >>> (953bedf8f7f54f91a76672aeee5f44dc465741e9) didn't mention a bugzilla >>> id). >> It has always been the case that I find and fix far more bugs than >> people report. >> I don't plan to start filing bugs for myself. > > True, I didn't mean that was necessary. A one liner about the fix would > have been helpful, though. There was one :-) I merged the best bits of three parallel CPG code paths. The things that prompted the extra bits in one also applied to the others. > >> In this case though, I made an exception because the plan is to NOT have >> another 1.1.x and "it is still my intention not to have API changes in >> 2.0.x, so better before than after". > > True. I admit I would probably have taken that as a reason for inserting > one more 1.1.x release, but I understand you have your own scheduling > requirements. I'm also quite over doing releases. I've been wanting to close the door on 1.1 since 2010, every extra release just leaves more time for new features (and cleanups) to be dreamt up. Maybe you're right, maybe I should stop fighting it and go with the firefox approach. That certainly seemed to piss a lot of people off though... > >> Granted I had completely forgotten about the plugin editions of ocfs2/dlm, >> but I was told you'd already deep frozen what you were planning to ship, so >> I don't understand the specific concern. > > We're already working towards the first maintenance update ;-) > > And it's not a specific concern (we'll delay that one until it's all > settled again), but we were discussing changes and how to schedule > releases for them; and from the point of view as a consumer/user, > introducing such a change in what was supposed to be the "last" release > candidate was a huge surprise. > >> There is never a good point to make these changes, even if I make them just >> after a release people will just grumble when it comes time to look at the >> next one - just like you did above for 1.1.8. > > No, that's not quite it. 1.1.8 *was* special. Also look at the impact it > had on our users; LRM rewrite (which implies huge logging changes), > crmsh split, changes to m/s ... 1.1.8 was an exception in what otherwise > is a rather impressive track record of releases. Normally, integrating > new releases has always been rather straightforward. But I don't want to > dwell on that. > >>> I don't really mind the API changes much, for me it's mostly a >>> question of timing and how big the pile is at every single >>> release. >> I thought you wanted longer release cycles... wouldn't that make the >> pile bigger? > > No, I generally tend to want shorter release cycles, with a more > moderate flow of changes. (As far as that is concerned.) If only life were so predictable. Of course there is often a wishlist, but what _actually_ goes into a release is very much a product of how many and what types of bugs are getting reported (what code needs the most attention and how much "free" time I have). The amount of help I get is a big factor too. Without NTT looking after 1.0.x or David making short work of various features, I definitely wouldn't have gotten to half of these cleanups. So no master plan to make anyone's life hell, just "Oh look, somehow I have a chunk of time to do something useful in" or "If I don't fix it now I'll have to look at it for the next decade". > >> And is it not better to batch them up and have one point of >> incompatibility rather than a continuous stream of them? > > Well, that means we end up backporting the changes that can't wait. And > the cliff to jump from to the next release just becomes larger and more > deterring. > >>> If you consider the API from a customer point of view, the things like >>> build or runtime dependencies on shared libraries aren't so much of an >>> issue - hopefully, the distribution provider hides that before they >>> release updates. Hence my "Oh well, I don't care" stance. >> Except if it affects ocf2/dlm/sbd? > > I don't care for the fact that the changes were made, much. The new > interface looks like a win and code reduction. My comment was, as stated > above, specifically about doing such a change in a late release > candida
Re: [Pacemaker] KVM live migration and multipath
Hi, On Wed, Jun 26, 2013 at 06:41:29PM +0200, Sven Arnold wrote: > > >>Do you manage the multipath daemon with pacemaker? In my setup multipath > >>is started at boot time and not managed by pacemaker. > > > >Yes. And all multipath records as well. That is the part of my iscsi RA. > > Ah, you are using your own RA? Is this one public? > I use lsb:open-iscsi since it looked to me that the included iscsi > RA relies on the operating system to start the iscsi daemon and I > wanted to avoid that. Yes it does. But that's lsb:open-iscsi, so you cannot really avoid using it :) Otherwise, it'd be interesting to see the other iscsi and if the good stuff from there can be included in the upstream iscsi. Cheers, Dejan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Node name problems after upgrading to 1.1.9
On 27/06/2013, at 8:20 PM, Bernardo Cabezas Serra wrote: > ¿Do you think it's a configuration problem? No, more likely a bug. Which is concerning since I thought I had this particular kind ironed out. Could you set PCMK_trace_functions=crm_get_peer on selavi and repeat the test? The exact way to do this will depend on your distro (which is?). On most rpm based distros its done in /etc/sysconfig/pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Node name problems after upgrading to 1.1.9
Hello Bernardo I don't know if this is the problem, but try this option clear_node_high_bit This configuration option is optional and is only relevant when no nodeid is specified. Some openais clients require a signed 32 bit nodeid that is greater than zero however by default openais uses all 32 bits of the IPv4 address space when generating a nodeid. Set this option to yes to force the high bit to be zero and therefor ensure the nodeid is a positive signed 32 bit integer. WARNING: The clusters behavior is undefined if this option is enabled on only a subset of the cluster (for example during a rolling upgrade). Thanks 2013/6/27 Bernardo Cabezas Serra > Hello, > > Our cluster was working OK on corosync stack, with corosync 2.3.0 and > pacemaker 1.1.8. > > After upgrading (full versions and configs below), we began to have > problems with node names. > It's a two node cluster, with node names "turifel" (DC) and "selavi". > > When selavi joins cluster, we have this warning at selavi log: > > - > Jun 27 11:54:29 selavi attrd[11998]: notice: corosync_node_name: > Unable to get node name for nodeid 168385827 > Jun 27 11:54:29 selavi attrd[11998]: notice: get_node_name: Defaulting > to uname -n for the local corosync node name > - > > This is ok, and also happenned with version 1.1.8. > > At corosync level, all seems ok: > > Jun 27 11:51:18 turifel corosync[6725]: [TOTEM ] A processor joined or > left the membership and a new membership (10.9.93.35:1184) was formed. > Jun 27 11:51:18 turifel corosync[6725]: [QUORUM] Members[2]: 168385827 > 168385835 > Jun 27 11:51:18 turifel corosync[6725]: [MAIN ] Completed service > synchronization, ready to provide service. > Jun 27 11:51:18 turifel crmd[19526]: notice: crm_update_peer_state: > pcmk_quorum_notification: Node selavi[168385827] - state is now member > (was lost) > --- > > But when starting pacemaker on selavi (the new node), turifel log shows > this: > > > Jun 27 11:54:28 turifel crmd[19526]: notice: do_state_transition: > State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN > cause=C_FSA_INTERNAL origin=peer_update_callback ] > Jun 27 11:54:28 turifel crmd[19526]: warning: crm_get_peer: Node > 'selavi' and 'selavi' share the same cluster nodeid: 168385827 > Jun 27 11:54:28 turifel crmd[19526]: warning: crmd_cs_dispatch: > Recieving messages from a node we think is dead: selavi[0] > Jun 27 11:54:29 turifel crmd[19526]: warning: crm_get_peer: Node > 'selavi' and 'selavi' share the same cluster nodeid: 168385827 > Jun 27 11:54:29 turifel crmd[19526]: warning: do_state_transition: Only > 1 of 2 cluster nodes are eligible to run resources - continue 0 > Jun 27 11:54:29 turifel attrd[19524]: notice: attrd_local_callback: > Sending full refresh (origin=crmd) > > > And selavi remains on pending state. Some times turifel (DC) fences > selavi, but other times remains pending forever. > > On turifel node, all resources gives warnings like this one: > warning: custom_action: Action p_drbd_ha0:0_monitor_0 on selavi is > unrunnable (pending) > > On both nodes, uname -n and crm_node -n gives correct node names (selavi > and turifel respectively) > > ¿Do you think it's a configuration problem? > > > Below I give information about versions and configurations. > > Best regards, > Bernardo. > > > - > Versions (git/hg compiled versions): > > corosync: 2.3.0.66-615d > pacemaker: 1.1.9-61e4b8f > cluster-glue: 1.0.11 > libqb: 0.14.4.43-bb4c3 > resource-agents: 3.9.5.98-3b051 > crmsh: 1.2.5 > > Cluster also has drbd, dlm and gfs2, but I think versions are unrelevant > here. > > > Output of pacemaker configuration: > ./configure --prefix=/opt/ha --without-cman \ > --without-heartbeat --with-corosync \ > --enable-fatal-warnings=no --with-lcrso-dir=/opt/ha/libexec/lcrso > > pacemaker configuration: > Version = 1.1.9 (Build: 61e4b8f) > Features = generated-manpages ascii-docs ncurses > libqb-logging libqb-ipc lha-fencing upstart nagios corosync-native snmp > libesmtp > > Prefix = /opt/ha > Executables = /opt/ha/sbin > Man pages= /opt/ha/share/man > Libraries= /opt/ha/lib > Header files = /opt/ha/include > Arch-independent files = /opt/ha/share > State information= /opt/ha/var > System configuration = /opt/ha/etc > Corosync Plugins = /opt/ha/lib > > Use system LTDL = yes > > HA group name= haclient > HA user name = hacluster > > CFLAGS = -I/opt/ha/include -I/opt/ha/include > -I/opt/ha/include/heartbeat-I/opt/ha/include -I/opt/ha/include > -ggdb -fgnu89-inline -fstack-protector-all -Wall -Waggregate-return > -Wbad-function-cast -Wcast-align -Wdeclaration-after-statement > -Wendif-labels -Wfloat-equal -Wformat=2 -Wformat-
[Pacemaker] Node name problems after upgrading to 1.1.9
Hello, Our cluster was working OK on corosync stack, with corosync 2.3.0 and pacemaker 1.1.8. After upgrading (full versions and configs below), we began to have problems with node names. It's a two node cluster, with node names "turifel" (DC) and "selavi". When selavi joins cluster, we have this warning at selavi log: - Jun 27 11:54:29 selavi attrd[11998]: notice: corosync_node_name: Unable to get node name for nodeid 168385827 Jun 27 11:54:29 selavi attrd[11998]: notice: get_node_name: Defaulting to uname -n for the local corosync node name - This is ok, and also happenned with version 1.1.8. At corosync level, all seems ok: Jun 27 11:51:18 turifel corosync[6725]: [TOTEM ] A processor joined or left the membership and a new membership (10.9.93.35:1184) was formed. Jun 27 11:51:18 turifel corosync[6725]: [QUORUM] Members[2]: 168385827 168385835 Jun 27 11:51:18 turifel corosync[6725]: [MAIN ] Completed service synchronization, ready to provide service. Jun 27 11:51:18 turifel crmd[19526]: notice: crm_update_peer_state: pcmk_quorum_notification: Node selavi[168385827] - state is now member (was lost) --- But when starting pacemaker on selavi (the new node), turifel log shows this: Jun 27 11:54:28 turifel crmd[19526]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ] Jun 27 11:54:28 turifel crmd[19526]: warning: crm_get_peer: Node 'selavi' and 'selavi' share the same cluster nodeid: 168385827 Jun 27 11:54:28 turifel crmd[19526]: warning: crmd_cs_dispatch: Recieving messages from a node we think is dead: selavi[0] Jun 27 11:54:29 turifel crmd[19526]: warning: crm_get_peer: Node 'selavi' and 'selavi' share the same cluster nodeid: 168385827 Jun 27 11:54:29 turifel crmd[19526]: warning: do_state_transition: Only 1 of 2 cluster nodes are eligible to run resources - continue 0 Jun 27 11:54:29 turifel attrd[19524]: notice: attrd_local_callback: Sending full refresh (origin=crmd) And selavi remains on pending state. Some times turifel (DC) fences selavi, but other times remains pending forever. On turifel node, all resources gives warnings like this one: warning: custom_action: Action p_drbd_ha0:0_monitor_0 on selavi is unrunnable (pending) On both nodes, uname -n and crm_node -n gives correct node names (selavi and turifel respectively) ¿Do you think it's a configuration problem? Below I give information about versions and configurations. Best regards, Bernardo. - Versions (git/hg compiled versions): corosync: 2.3.0.66-615d pacemaker: 1.1.9-61e4b8f cluster-glue: 1.0.11 libqb: 0.14.4.43-bb4c3 resource-agents: 3.9.5.98-3b051 crmsh: 1.2.5 Cluster also has drbd, dlm and gfs2, but I think versions are unrelevant here. Output of pacemaker configuration: ./configure --prefix=/opt/ha --without-cman \ --without-heartbeat --with-corosync \ --enable-fatal-warnings=no --with-lcrso-dir=/opt/ha/libexec/lcrso pacemaker configuration: Version = 1.1.9 (Build: 61e4b8f) Features = generated-manpages ascii-docs ncurses libqb-logging libqb-ipc lha-fencing upstart nagios corosync-native snmp libesmtp Prefix = /opt/ha Executables = /opt/ha/sbin Man pages= /opt/ha/share/man Libraries= /opt/ha/lib Header files = /opt/ha/include Arch-independent files = /opt/ha/share State information= /opt/ha/var System configuration = /opt/ha/etc Corosync Plugins = /opt/ha/lib Use system LTDL = yes HA group name= haclient HA user name = hacluster CFLAGS = -I/opt/ha/include -I/opt/ha/include -I/opt/ha/include/heartbeat-I/opt/ha/include -I/opt/ha/include -ggdb -fgnu89-inline -fstack-protector-all -Wall -Waggregate-return -Wbad-function-cast -Wcast-align -Wdeclaration-after-statement -Wendif-labels -Wfloat-equal -Wformat=2 -Wformat-security -Wformat-nonliteral -Wmissing-prototypes -Wmissing-declarations -Wnested-externs -Wno-long-long -Wno-strict-aliasing -Wunused-but-set-variable -Wpointer-arith -Wstrict-prototypes -Wwrite-strings Libraries= -lgnutls -lcorosync_common -lplumb -lpils -lqb -lbz2 -lxslt -lxml2 -lc -luuid -lpam -lrt -ldl -lglib-2.0 -lltdl -L/opt/ha/lib -lqb -ldl -lrt -lpthread Stack Libraries = -L/opt/ha/lib -lqb -ldl -lrt -lpthread -L/opt/ha/lib -lcpg -L/opt/ha/lib -lcfg -L/opt/ha/lib -lcmap -L/opt/ha/lib -lquorum Corosync config: totem { version: 2 crypto_cipher: none crypto_hash: none cluster_name: fiestaha interface { ringnumber: 0 ttl: 1 bindnetaddr: 10.9.93.0 mcastaddr: 226.94.1.1 mcastport: 5405 } } logging { fileline: off to_stderr: yes to_logf
Re: [Pacemaker] pcs and ping location rule
Il 27/06/2013 01:22, Chris Feist ha scritto: On 06/24/13 16:33, Mailing List SVR wrote: Hi, I defined this clone resource for connectivity check: pcs resource create ping ocf:pacemaker:ping host_list="10.0.2.2" multiplier="1000" dampen=10s op monitor interval=60s pcs resource clone ping ping_clone globally-unique=false these works, but now I need to add a location rules to make the service switch on the node that reach the gw, with crm I used something like this /location database_on_connected_node database_resource \/ / rule $id="database_on_connected_node-rule" pingd: defined pingd/ how to do the same using pcs? I'm still working on fully implementing rules into pcs, but you can run the following command with the latest pcs (was just fixed today). pcs constraint location database_resource rule pingd: defined pingd Let me know if you have any issues implementing this setting. I installed pcs from git and it works, thanks! Nicola Thanks! Chris thanks Nicola ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Reminder: Pacemaker-1.1.10-rc5 is out there
On 2013-06-27T14:28:19, Andrew Beekhof wrote: > I wouldn't say the 6 months between 1.1.7 and 1.1.8 was a particularly > aggressive release cycle. For the amount of changes in there, I think yes. And the intrusive ones didn't show up all at the beginning of that cycle, either. That just made integration difficult - basically, we ended up skipping 1.1.8/1.1.9 and went mostly with 1.1.10-rcx. > Generally though, it has always been hard/dangerous to backport > specific fixes because of the complexity of the interactions - > particularly in the PE. Yes, that's why we try to avoid backports ;-) > > Yeah, that one. If it fixes a bug, it was probably unavoidable > > (though the specific commit > > (953bedf8f7f54f91a76672aeee5f44dc465741e9) didn't mention a bugzilla > > id). > It has always been the case that I find and fix far more bugs than > people report. > I don't plan to start filing bugs for myself. True, I didn't mean that was necessary. A one liner about the fix would have been helpful, though. > In this case though, I made an exception because the plan is to NOT have > another 1.1.x and "it is still my intention not to have API changes in 2.0.x, > so better before than after". True. I admit I would probably have taken that as a reason for inserting one more 1.1.x release, but I understand you have your own scheduling requirements. > Granted I had completely forgotten about the plugin editions of ocfs2/dlm, > but I was told you'd already deep frozen what you were planning to ship, so I > don't understand the specific concern. We're already working towards the first maintenance update ;-) And it's not a specific concern (we'll delay that one until it's all settled again), but we were discussing changes and how to schedule releases for them; and from the point of view as a consumer/user, introducing such a change in what was supposed to be the "last" release candidate was a huge surprise. > There is never a good point to make these changes, even if I make them just > after a release people will just grumble when it comes time to look at the > next one - just like you did above for 1.1.8. No, that's not quite it. 1.1.8 *was* special. Also look at the impact it had on our users; LRM rewrite (which implies huge logging changes), crmsh split, changes to m/s ... 1.1.8 was an exception in what otherwise is a rather impressive track record of releases. Normally, integrating new releases has always been rather straightforward. But I don't want to dwell on that. > > I don't really mind the API changes much, for me it's mostly a > > question of timing and how big the pile is at every single > > release. > I thought you wanted longer release cycles... wouldn't that make the > pile bigger? No, I generally tend to want shorter release cycles, with a more moderate flow of changes. (As far as that is concerned.) > And is it not better to batch them up and have one point of > incompatibility rather than a continuous stream of them? Well, that means we end up backporting the changes that can't wait. And the cliff to jump from to the next release just becomes larger and more deterring. > > If you consider the API from a customer point of view, the things like > > build or runtime dependencies on shared libraries aren't so much of an > > issue - hopefully, the distribution provider hides that before they > > release updates. Hence my "Oh well, I don't care" stance. > Except if it affects ocf2/dlm/sbd? I don't care for the fact that the changes were made, much. The new interface looks like a win and code reduction. My comment was, as stated above, specifically about doing such a change in a late release candidate. > > What's more troublesome are changes to existing commands (even > > something minimal like "crm_resource -M" now generating a short > > location constraint, > I find it confusing how an contained 10 line change to a CLI tool is > troublesome Because it shows through to users. And it's then when I start having to worry about users complaining when their scripts break (that expected the tools to do something specific; though I hope that in this case it doesn't - we'll document it as an optimization). > but you're prepared to wear the overhead of holding back > API changes - which usually impact all sorts of code paths, sometimes > across multiple projects. We don't hold them back. We'll fix the dependencies before the release. It's my intention to keep being as close to a (tested) upstream version as possible, because everything else ends up being very costly. The difference is that the latter is a cost that is internal to us as developers, and not something that users/customers would possibly care about. (Unless they've written their own add-on code, which is truly rare.) > CLI output I can usually be convinced of, but log messages are most > definitely not something I will consider as a valid programming interface. Agreed. It's not an API, that's for sure. Yet it's one of the i