[Pacemaker] [pacemaker] To start all the resources on one node when HA starts in 2 node configuration.
Hi All Here is my description regarding this While configuring HA i used this CLI command *crm configure location HTTPD Httpd rule id="HTTPD-rule" 100: \#uname eq hatest1rule id="HTTPD-rule1" 200: \#uname eq hatest2* where Httpd is resource and given score 100 for hatest1 and score 200 for node -2 hatest2 similarly there are other three resources where i have given score 100 for first node and score 200 for second node when HA starts it checks for the scores and starts the processes on hatest2 Is there any other better way such that heartbeat/pacemaker checks the node level configuration rather than HA checks resource location constraint . Regards Rakesh ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Clearing a resource which returned "not installed" from START
I am running Pacemaker 1.0.9 and Heartbeat 3.0.3. I started a resource and the agent start method returned "OCF_ERR_INSTALLED". I have fixed the problem and I would like to restart the resource and I cannot get it to restart. Any ideas? Thanks, Bob The failcounts are 0 as shown below and with the crm_resource command: # crm_mon -1 -f Last updated: Wed Mar 30 19:55:39 2011 Stack: Heartbeat Current DC: mgraid-sd6661-0 (f4e5e15c-d06b-4e37-89b9-4621af05128f) - partition with quorum Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677 2 Nodes configured, unknown expected votes 5 Resources configured. Online: [ mgraid-sd6661-1 mgraid-sd6661-0 ] Clone Set: Fencing Started: [ mgraid-sd6661-1 mgraid-sd6661-0 ] Clone Set: cloneIcms Started: [ mgraid-sd6661-1 mgraid-sd6661-0 ] Clone Set: cloneOmserver Started: [ mgraid-sd6661-1 mgraid-sd6661-0 ] Master/Slave Set: ms-SSSD6661 Masters: [ mgraid-sd6661-0 ] Slaves: [ mgraid-sd6661-1 ] Master/Slave Set: ms-SSJD6662 Masters: [ mgraid-sd6661-0 ] Stopped: [ SSJD6662:0 ] Migration summary: * Node mgraid-sd6661-0: * Node mgraid-sd6661-1: Failed actions: SSJD6662:0_start_0 (node=mgraid-sd6661-1, call=27, rc=5, status=complete): not installed I have also tried to cleanup the resource with these commands: # crm_resource --resource SSJD6662:0 --cleanup --node mgraid-sd6661-1 # crm_resource --resource SSJD6662:1 --cleanup --node mgraid-sd6661-1 # crm_resource --resource SSJD6662:0 --cleanup --node mgraid-sd6661-0 # crm_resource --resource SSJD6662:1 --cleanup --node mgraid-sd6661-0 # crm_resource --resource ms-SSJD6662 --cleanup --node mgraid-sd6661-1 # crm resource start SSJD6662:0 My configuration is: node $id="856c1f72-7cd1-4906-8183-8be87eef96f2" mgraid-sd6661-1 node $id="f4e5e15c-d06b-4e37-89b9-4621af05128f" mgraid-sd6661-0 primitive SSJD6662 ocf:omneon:ss \ params ss_resource="SSJD6662" ssconf="/var/omneon/config/config.JD6662" \ op monitor interval="3s" role="Master" timeout="7s" \ op monitor interval="10s" role="Slave" timeout="7" \ op stop interval="0" timeout="20" \ op start interval="0" timeout="300" primitive SSSD6661 ocf:omneon:ss \ params ss_resource="SSSD6661" ssconf="/var/omneon/config/config.SD6661" \ op monitor interval="3s" role="Master" timeout="7s" \ op monitor interval="10s" role="Slave" timeout="7" \ op stop interval="0" timeout="20" \ op start interval="0" timeout="300" primitive icms lsb:S53icms \ op monitor interval="5s" timeout="7" \ op start interval="0" timeout="5" primitive mgraid-stonith stonith:external/mgpstonith \ params hostlist="mgraid-canister" \ op monitor interval="0" timeout="20s" primitive omserver lsb:S49omserver \ op monitor interval="5s" timeout="7" \ op start interval="0" timeout="5" ms ms-SSJD6662 SSJD6662 \ meta clone-max="2" notify="true" globally-unique="false" target-role="Started" ms ms-SSSD6661 SSSD6661 \ meta clone-max="2" notify="true" globally-unique="false" target-role="Started" clone Fencing mgraid-stonith clone cloneIcms icms clone cloneOmserver omserver location ms-SSJD6662-master-w1 ms-SSJD6662 \ rule $id="ms-SSJD6662-master-w1-rule" $role="master" 100: #uname eq mgraid-sd6661-1 location ms-SSSD6661-master-w1 ms-SSSD6661 \ rule $id="ms-SSSD6661-master-w1-rule" $role="master" 100: #uname eq mgraid-sd6661-0 order orderms-SSJD6662 0: cloneIcms ms-SSJD6662 order orderms-SSSD6661 0: cloneIcms ms-SSSD6661 property $id="cib-bootstrap-options" \ dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \ cluster-infrastructure="Heartbeat" \ dc-deadtime="5s" \ stonith-enabled="true" \ last-lrm-refresh="1301536426" ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] lrmd: WARN: G_SIG_dispatch: Dispatch function for S 1000 ms (> 100 ms) before being called
On 3/31/2011 at 06:05 AM, Jean-Francois Malouin wrote: > Hi, > > A little more than a month ago I posted on the subjet line warning and > was told that they were harmless unless very frequent. They are now > popping more than 10 times a day. > I was asked to create a bug report if I wanted more info. So now I > have an hb_report ready to go. Excuse the naive question, but where/how > do I submit it? http://developerbugs.linux-foundation.org/enter_bug.cgi HTH, Tim -- Tim Serong Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] [Problem]Reboot by the error of the clone resource influences the resource of other nodes.
Hi All, We tested the trouble of the clone resource in the next procedure. Step1) We start a cluster in three nodes. Last updated: Thu Mar 31 10:01:47 2011 Stack: Heartbeat Current DC: srv03 (e2ffc1ed-3ebe-47e2-b51b-b0f04b454311) - partition with quorum Version: 1.0.10-9342a4147fc69f2081f8563a34509da5be0a89d0 3 Nodes configured, unknown expected votes 4 Resources configured. Node srv01 (45f985d7-e7c8-4834-b01b-16b99526672b): online main_rsc(ocf::pacemaker:Dummy) Started prmDummy1:0 (ocf::pacemaker:Dummy) Started prmPingd:0 (ocf::pacemaker:ping) Started Node srv02 (ed7fdcbf-9c17-4f31-8a27-a831a6b39ed5): online prmDummy1:1 (ocf::pacemaker:Dummy) Started main_rsc2 (ocf::pacemaker:Dummy) Started prmPingd:1 (ocf::pacemaker:ping) Started Node srv03 (e2ffc1ed-3ebe-47e2-b51b-b0f04b454311): online prmDummy1:2 (ocf::pacemaker:Dummy) Started prmPingd:2 (ocf::pacemaker:ping) Started Inactive resources: Migration summary: * Node srv01: pingd=1 * Node srv03: pingd=1 * Node srv02: pingd=1 Step2) In a srv01 node, We generate the trouble of the clone resource. [root@srv01 ~]# rm -rf /var/run/Dummy-prmDummy1.state Step3) In a srv02 node, it takes the reboot of the pingd clone. Under influence of this, rebooting, main_rsc2 reboots. * The number of the clone becomes funny somehow or other, too. [root@srv02 ~]# tail -f /var/log/ha-log | grep stop Mar 31 10:02:22 srv02 crmd: [24471]: info: do_lrm_rsc_op: Performing key=29:4:0:6c32b0f8-d37a-4ebc-8365-30e2e02ba9d3 op=prmPingd:1_stop_0 ) Mar 31 10:02:25 srv02 lrmd: [24468]: info: rsc:prmPingd:1:12: stop Mar 31 10:02:25 srv02 crmd: [24471]: info: process_lrm_event: LRM operation prmPingd:1_stop_0 (call=12, rc=0, cib-update=21, confirmed=true) ok Mar 31 10:02:33 srv02 crmd: [24471]: info: do_lrm_rsc_op: Performing key=9:5:0:6c32b0f8-d37a-4ebc-8365-30e2e02ba9d3 op=main_rsc2_stop_0 ) Mar 31 10:02:33 srv02 lrmd: [24468]: info: rsc:main_rsc2:14: stop Mar 31 10:02:33 srv02 crmd: [24471]: info: process_lrm_event: LRM operation main_rsc2_stop_0 (call=14, rc=0, cib-update=23, confirmed=true) ok Last updated: Thu Mar 31 10:02:40 2011 Stack: Heartbeat Current DC: srv03 (e2ffc1ed-3ebe-47e2-b51b-b0f04b454311) - partition with quorum Version: 1.0.10-9342a4147fc69f2081f8563a34509da5be0a89d0 3 Nodes configured, unknown expected votes 4 Resources configured. Node srv01 (45f985d7-e7c8-4834-b01b-16b99526672b): online Node srv02 (ed7fdcbf-9c17-4f31-8a27-a831a6b39ed5): online prmDummy1:1 (ocf::pacemaker:Dummy) Started -> :1(funny) prmPingd:0 (ocf::pacemaker:ping) Started -> :0(funny) Node srv03 (e2ffc1ed-3ebe-47e2-b51b-b0f04b454311): online main_rsc(ocf::pacemaker:Dummy) Started prmDummy1:2 (ocf::pacemaker:Dummy) Started -> :2(funny) prmPingd:1 (ocf::pacemaker:ping) Started -> :1(funny) Inactive resources: main_rsc2 (ocf::pacemaker:Dummy): Stopped Clone Set: clnDummy1 Started: [ srv02 srv03 ] Stopped: [ prmDummy1:0 ] Clone Set: clnPingd Started: [ srv02 srv03 ] Stopped: [ prmPingd:2 ] Migration summary: * Node srv01: prmDummy1:0: migration-threshold=1 fail-count=1 * Node srv03: pingd=1 * Node srv02: pingd=1 Failed actions: prmDummy1:0_monitor_1 (node=srv01, call=8, rc=7, status=complete): not running We think the reboot of pingd to be unnecessary in a srv02 node. Is there the method how this problem is settled? Possibly the next bug may be related. * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2508 I registered the log with Bugzilla.(attached hb_report) * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2574 Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] updating resource attributes
What I'm looking for is a way to pass parameters to my resource stop operation. My first attempt has been to set the paramter with crm_resource and then stop the resource. 1) crm_resource --resource myres --set-parameter myparam --parameter-value myvalue 2) crm_resource --resource myres --set-parameter target-role --meta --parameter-value Stopped Unfortunately, step 1 results in the resource being restarted in order to update the agent. As this resource takes time to stop and start, it is not a good design for me. A friend suggested defining another resource with null start and stops and put the params in there, however, I have two objections: 1. the params would no longer be instance specific 2. it is more difficult to access the values, i.e. instance params come in the environment My first choice would be to disable this restart on param change behavior of Pacemaker. Does anyone have suggestions? Alan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] lrmd: WARN: G_SIG_dispatch: Dispatch function for S 1000 ms (> 100 ms) before being called
Hi, A little more than a month ago I posted on the subjet line warning and was told that they were harmless unless very frequent. They are now popping more than 10 times a day. I was asked to create a bug report if I wanted more info. So now I have an hb_report ready to go. Excuse the naive question, but where/how do I submit it? thanks jf ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] IPaddr2 Netmask Bug Fix Issue
Hi, On Wed, Mar 30, 2011 at 09:26:49AM +0100, darren.mans...@opengi.co.uk wrote: > From: Pavel Levshin [mailto:pa...@levshin.spb.ru] > Sent: 25 March 2011 19:50 > To: pacemaker@oss.clusterlabs.org > Subject: Re: [Pacemaker] IPaddr2 Netmask Bug Fix Issue > > > > 25.03.2011 18:47, darren.mans...@opengi.co.uk: > > > > > > We configure a virtual IP on the non-arping lo interface of both servers > and then configure the IPaddr2 resource with lvs_support=true. This RA > will remove the duplicate IP from the lo interface when it becomes > active. Grouping the VIP with ldirectord/LVS we can have the > load-balancer and VIP on one node, balancing traffic to the other node > with failover where both resources failover together. > > > > To do this we need to configure the VIP on lo as a 32 bit netmask but > the VIP on the eth0 interface needs to have a 24 bit netmask. This has > worked fine up until now and we base all of our clusters on this method. > Now what happens is that the find_interface() routine in IPaddr2 doesn't > remove the IP from lo when starting the VIP resource as it can't find it > due to the netmask not matching. Can you please open a bugzilla and attach hb_report. Thanks, Dejan > Do you really need the address to be deleted from lo? Having two > identical addresses on the Linux machine should not harm, if routing was > not affected. In your case, with /32 netmask on lo, I do not foresee any > problems. > > We use it in this way, i.e. with the address set on lo permanently. > > > -- > Pavel Levshin > > > > > > Thanks Pavel, > > > > However, this means I would have to disable LVS support for the > resource. Which means that to make it work with LVS I have to set > lvs_support to false. > > > > Of course, I'll do whatever it takes on my set up to make it work, but > it's not intuitive for other users. > > > > Regards > > Darren Mansell > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Pacemaker warm-up latency
Dear all, I have a similar question to that on http://oss.clusterlabs.org/pipermail/pacemaker/2011-March/009750.html. At the moment I start corosync and pacemaker, the monitoring status by executing crm_mon is: * Last updated: Wed Mar 30 21:51:49 2011 Stack: openais Current DC: NONE 2 Nodes configured, 2 expected votes 4 Resources configured. Node alpha1: OFFLINE Node alpha2: OFFLINE * After around 1 min. the monitor status becomes: * Last updated: Wed Mar 30 21:52:54 2011 Stack: openais Current DC: alpha1 - partition with quorum Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 2 Nodes configured, 2 expected votes 4 Resources configured. Node alpha1: online Disk:0 (ocf::linbit:drbd) Master ClusterIP (ocf::heartbeat:IPaddr2) Started FS (ocf::heartbeat:Filesystem) Started WebSite (ocf::heartbeat:apache) Started Node alpha2: online Disk:1 (ocf::linbit:drbd) Slave * My questions are: 1. Why does the pacemaker warm-up behavior spend some time? Is it controlled by a configuration value? 2. At the warm-up time, how can I detect that the node status becomes online? Using a shell script to parse crm_mon result periodically until the “online” sub-string? Thanks for your help in advance,. Chia-Feng Kang 本信件可能包含工研院機密資訊,非指定之收件者,請勿使用或揭露本信件內容,並請銷毀此信件。 This email may contain confidential information. Please do not use or disclose it in any way and delete it if you are not the intended recipient.___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Software-only STONITH device
Hi Dejan, Based on the information you provided, I also study the document of IBM RSA (http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-4ZVQKY ) and HP iLO (http://h2.www2.hp.com/bc/docs/support/SupportManual/c00553302/c00553302.pdf ). It seems that the minimum number of nodes required in a STONITH-enabled are 3. Thanks you for your help. Chia-Feng Kang -Original Message- From: c...@itri.org.tw [mailto:c...@itri.org.tw] Sent: Wednesday, March 30, 2011 10:11 AM To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] Software-only STONITH device Hello, I learned that IBMRSA architecture from http://www.opengear.com/SP-IBM.html. Moreover, I also plan to study external/ipmi and ibmhmc at the same time. Assume I have a two-node cluster, and each node is equipped with more than one Ethernet interface. Can I use external/ipmi and ibmhmc to set up a STONITH-enabled cluster? Are these connection internal independent (Ethernet interfaces of one node can't communicated with each other.) for out-of-band management (http://en.wikipedia.org/wiki/Out-of-band)? Thanks for your help again. Chia-Feng Kang Ps: In IBMHMC redbook on http://publib-b.boulder.ibm.com/redbooks.nsf/RedbookAbstracts/SG247038.html, it seems that serial connection is still required if Figure 1-11 is referenced. -Original Message- From: Dejan Muhamedagic [mailto:deja...@fastmail.fm] Sent: Tuesday, March 29, 2011 9:04 PM To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Software-only STONITH device Hi, On Tue, Mar 29, 2011 at 08:45:13PM +0800, c...@itri.org.tw wrote: > Dear all, > > As a beginner to Fencing/STONITH implementation in Linux HA, I try to set up > a two-node STONITH-enabled cluster environment. > > The STONITH devices supported in my environment are listed below: > > apcmaster > apcmastersnmp > apcsmart > baytech > bladehpi > cyclades > external/drac5 > external/dracmc-telnet > external/hmchttp > external/ibmrsa > external/ibmrsa-telnet > external/ipmi > external/ippower9258 > external/kdumpcheck > external/rackpdu > external/riloe > external/sbd > external/vmware > external/xen0 > external/xen0-ha > ibmhmc > ipmilan > meatware > nw_rpc100s > rcd_serial > rps10 > suicide > wti_mpc > wti_nps > > Is there any software-only, implementation without additional Quorum node( > please let me know the appropriate term because I think the noun is not > clear), STONITH device among them available? No. The closest is rcd_serial for which you need to build a special serial cable. Otherwise, I strongly advise to obtain either computers with some lights-out device (iLO or IBM RSA or similar) or a PDU/UPS. Take a look here http://www.clusterlabs.org/doc/crm_fencing.html for more details. Thanks, Dejan > Thanks in advance. > > Chia-Feng Kang > > > > > > > > 本信件可能包含工研院機密資訊,非指定之收件者,請勿使用或揭露本信件內容,並請銷毀此信件。 > This email may contain confidential information. Please do not use or > disclose it in any way and delete it if you are not the intended recipient. > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker 本信件可能包含工研院機密資訊,非指定之收件者,請勿使用或揭露本信件內容,並請銷毀此信件。 This email may contain confidential information. Please do not use or disclose it in any way and delete it if you are not the intended recipient. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker 本信件可能包含工研院機密資訊,非指定之收件者,請勿使用或揭露本信件內容,並請銷毀此信件。 This email may contain confidential information. Please do not use or disclose it in any way and delete it if you are not the intended recipient. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch
Re: [Pacemaker] How to send email-notification on failure of resource in cluster frame work
On Mar 29, 2011, at 11:34 PM, Michael Schwartzkopff wrote: >> On Mar 29, 2011 6:12 AM, "Michael Schwartzkopff" >> >> wrote: On Tue, Mar 29, 2011 at 3:29 AM, Vadym Chepkov >> >> wrote: > On Mar 24, 2011, at 12:46 AM, Rakesh K wrote: >> Hi ALL >> Is there any way to send Email notifications when a resource is >> >> failure >> >> in the cluster frame work. >> >> while i was going through the Pacemaker-explained document provided >> >> in >> >> the website www.clusterlabs.org >> >> There was no content in the chapter 7 --> which is sending email >> notification events. >> >> can anybody help me regarding this. >> >> for know i am approaching the crm_mon --daemonize --as-html > fil> to maintain the status of HA in html file. >> >> Is there any other approach for sending email notification. > > Last time I checked, crm_mon is not well suited for this purpose. > > crm_mon has the following option > > -T, --mail-to=value > > Send Mail alerts to this user.See also > > --mail-from, --mail-host, --mail-prefix > > But you will end-up with obscene amount of e-mails, I was blocked > from gmail when I tried to use it once :) For one resource failure > you will get 4 e-mails: monitor,stop,start,monitor. Now imagine if > it was a >> >> most >> > significant member of a group or worse, node failure... > > nagios would be better suited for this purpose, but, unfortunately, > crm_mon is broken > (http://developerbugs.linux-foundation.org/show_bug.cgi?id=2344) for > quite awhile. The fix is going to have to come from the community, I don't have any knowledge of nagios > I am yet to find a good monitoring solution for pacemaker, hopefully > somebody had more success and will share. >>> >>> Use SNMP. It is the standard protocol for monitoring. Add a "extend" line >> >> to >> >>> your snmpd.conf to call a script that returns the number of failcounts. >> >> You >> >>> can easily monitoring this with every NMS. For nagios use check_snmp. >> >> I afraid it won't be able to tell more then "stuff happened" :( >> Would it? > > Yes. Like a good NMS always does. To analyse the error you still have to read > the logs yourself. > What I meant was, I can't see how one "extend" line will be able to supply specifics about what exactly resource has failed. Would you kindly share en example? I was trying to integrate crm_mon with SNMP Trap Translator (snmptt), but haven't had luck with it either. I posted details in another thread. Lack of "out-of-the-box" monitoring solution for pacemaker is a major deficiency in my daily use, I am sure I am not alone. Maybe it's out there, but Chapter 7 of "Pacemaker Explained" is yet to be written. Thanks, Vadym ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] IPaddr2 Netmask Bug Fix Issue
From: Pavel Levshin [mailto:pa...@levshin.spb.ru] Sent: 25 March 2011 19:50 To: pacemaker@oss.clusterlabs.org Subject: Re: [Pacemaker] IPaddr2 Netmask Bug Fix Issue 25.03.2011 18:47, darren.mans...@opengi.co.uk: We configure a virtual IP on the non-arping lo interface of both servers and then configure the IPaddr2 resource with lvs_support=true. This RA will remove the duplicate IP from the lo interface when it becomes active. Grouping the VIP with ldirectord/LVS we can have the load-balancer and VIP on one node, balancing traffic to the other node with failover where both resources failover together. To do this we need to configure the VIP on lo as a 32 bit netmask but the VIP on the eth0 interface needs to have a 24 bit netmask. This has worked fine up until now and we base all of our clusters on this method. Now what happens is that the find_interface() routine in IPaddr2 doesn't remove the IP from lo when starting the VIP resource as it can't find it due to the netmask not matching. Do you really need the address to be deleted from lo? Having two identical addresses on the Linux machine should not harm, if routing was not affected. In your case, with /32 netmask on lo, I do not foresee any problems. We use it in this way, i.e. with the address set on lo permanently. -- Pavel Levshin Thanks Pavel, However, this means I would have to disable LVS support for the resource. Which means that to make it work with LVS I have to set lvs_support to false. Of course, I'll do whatever it takes on my set up to make it work, but it's not intuitive for other users. Regards Darren Mansell ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker