Re: [Pacemaker] [Problem]Time-out(action lost) of completed monitor occurs.
Hi Andrew, > Ok, I've recreated as http://bugs.clusterlabs.org/show_bug.cgi?id=5001 All right. Thanks!. Hideo Yamauchi. > > On Mon, Sep 26, 2011 at 6:27 PM, wrote: > > Hi Andrew, > > > > Thank you for comment. > > > >> Which still appears to be down :-( > >> Do you have the tarball still? > > > > I may not be the completely same as the contents which I attached for > > Bugzilla. > > I send log and pe-file again. > > * 1655.tar.gz > > * https://skydrive.live.com/?cid=3a14d57622c66876&id=3A14D57622C66876%21117 > > > > Best Regards, > > Hideo Yamauchi. > > > > --- On Mon, 2011/9/26, Andrew Beekhof wrote: > > > >> On Tue, Sep 6, 2011 at 12:53 PM, wrote: > >> > Hi All, > >> > > >> > We came across a mysterious phenomenon on a test of the drbd environment. > >> > > >> > It is the following procedure. > >> > > >> > Step 1) Start two nodes. > >> > > >> > Step 2) Cause the hang of the kernel in an active node. > >> > > >> > Step 3) In a standby node, the cancellation of the monitor of drbd is > >> > carried > >> > out. > >> > > >> > The cancellation of the monitor of drbd is completed, but a timer occurs. > >> > > >> > Because it was completed, the cancellation of the monitor of drbd should > >> > stop > >> > the timer. > >> > > >> > >> [snip] > >> > >> > > >> > I registered this problem with Bugzilla. > >> > * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2639 > >> > >> Which still appears to be down :-( > >> Do you have the tarball still? > >> > > > > ___ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Problem]Time-out(action lost) of completed monitor occurs.
Ok, I've recreated as http://bugs.clusterlabs.org/show_bug.cgi?id=5001 On Mon, Sep 26, 2011 at 6:27 PM, wrote: > Hi Andrew, > > Thank you for comment. > >> Which still appears to be down :-( >> Do you have the tarball still? > > I may not be the completely same as the contents which I attached for > Bugzilla. > I send log and pe-file again. > * 1655.tar.gz > * https://skydrive.live.com/?cid=3a14d57622c66876&id=3A14D57622C66876%21117 > > Best Regards, > Hideo Yamauchi. > > --- On Mon, 2011/9/26, Andrew Beekhof wrote: > >> On Tue, Sep 6, 2011 at 12:53 PM, wrote: >> > Hi All, >> > >> > We came across a mysterious phenomenon on a test of the drbd environment. >> > >> > It is the following procedure. >> > >> > Step 1) Start two nodes. >> > >> > Step 2) Cause the hang of the kernel in an active node. >> > >> > Step 3) In a standby node, the cancellation of the monitor of drbd is >> > carried >> > out. >> > >> > The cancellation of the monitor of drbd is completed, but a timer occurs. >> > >> > Because it was completed, the cancellation of the monitor of drbd should >> > stop >> > the timer. >> > >> >> [snip] >> >> > >> > I registered this problem with Bugzilla. >> > * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2639 >> >> Which still appears to be down :-( >> Do you have the tarball still? >> > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue
On Fri, Oct 7, 2011 at 8:40 PM, Proskurin Kirill wrote: > On 10/07/2011 02:13 AM, Andrew Beekhof wrote: >> >> On Thu, Oct 6, 2011 at 2:47 AM, Proskurin Kirill >> wrote: >>> >>> On 10/05/2011 04:19 AM, Andrew Beekhof wrote: On Mon, Oct 3, 2011 at 5:50 PM, Proskurin Kirill wrote: > > On 10/03/2011 05:32 AM, Andrew Beekhof wrote: >>> >>> corosync-1.4.1 >>> pacemaker-1.1.5 >>> pacemaker runs with "ver: 1" > >>> 2) >>> This one is scary. >>> I twice run on situation then pacemaker thinks what resource is >>> started >>> but >>> it is not. >> >> RA is misbehaving. Pacemaker will only consider a resource running if >> the RA tells us it is (running or in a failed state). > > But you can see below, what agent return "7". Its still broken. Not one stop action succeeds. Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 4082) timed out (try 1). Killing with signal SIGTERM (15). Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 21859) timed out (try 1). Killing with signal SIGTERM (15). Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN: tranprocessor:stop process (PID 24576) timed out (try 1). Killing with signal SIGTERM (15). /That/ is why pacemaker thinks its still running. >>> >>> I made an experiment. >>> >>> I create script what don`t die at SIGTERM >>> >>> #!/usr/bin/perl >>> $SIG{TERM} = "IGNORE"; sleep 1 while 1 >>> >>> And run it on pacemaker. >>> I run 3 tests: >>> 1) primitive test-kill-15.pl ocf:mail.ru:generic \ >>> op monitor interval="20" timeout="5" on-fail="restart" \ >>> params binfile="/tmp/test-kill-15.pl" external_pidfile="1" >>> >>> 2) Same but on-fail=block >>> >>> 3) Same but with metaware stonith. >>> >>> Each time I do: >>> crm resource stop test-kill-15.pl >>> >>> And in case 1 and 2 - I get "unmanaged" on this resource. Because you've not configured any fencing devices. >>> In case 3 I get stonith situation. Because now there is something the cluster can do to try and automate recovery when the stop operation fails. >> >> I can't comment based on only a partial config. > > Sorry for that. I attached full crm config & logs of that day. > Resource called test-kill-15.pl > > -- > Best regards, > Proskurin Kirill > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [PATCH] This is an alternate fix for Bug #2528 based on a patch to the
This part: +} else if (hash_entry->timer_id != 0) { +crm_debug_2("Update already scheduled"); +return; is definitely wrong. Subsequent changes to an attribute value are intended to reset the timer. On Thu, Sep 15, 2011 at 7:50 AM, Rainer Weikusat wrote: > # HG changeset patch > # User Rainer Weikusat > # Date 1316036167 -3600 > # Branch stable-1.0 > # Node ID ea611ef8c1e6a9d294d9d0dff6db2f317232292b > # Parent a15ead49e20f047e129882619ed075a65c1ebdfe > This is an alternate fix for Bug #2528 based on a patch to the > Debian Squeeze pacemaker package used to provide 'high availability' > for the product I'm presently paid to work on. As opposed to the > change documented at > > http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/76bd1e3370b8 > > it doesn't test equality of value and hash_entry->value twice in > order to determine if an identical updated was already scheduled > and it avoids 'recalculating' hash_entry->value if its value is > still current. > > diff -r a15ead49e20f -r ea611ef8c1e6 tools/attrd.c > --- a/tools/attrd.c Thu Aug 25 16:49:59 2011 +1000 > +++ b/tools/attrd.c Wed Sep 14 22:36:07 2011 +0100 > @@ -764,49 +764,47 @@ > > crm_debug("Supplied: %s, Current: %s, Stored: %s", > value, hash_entry->value, hash_entry->stored_value); > - > - if(safe_str_eq(value, hash_entry->value) > - && safe_str_eq(value, hash_entry->stored_value)) { > - crm_debug_2("Ignoring non-change"); > - return; > > - } else if(value) { > - int offset = 1; > - int int_value = 0; > - int value_len = strlen(value); > - if(value_len < (plus_plus_len + 2) > - || value[plus_plus_len] != '+' > - || (value[plus_plus_len+1] != '+' && value[plus_plus_len+1] != > '=')) { > - goto set_unexpanded; > - } > + if (!safe_str_eq(value, hash_entry->value)) { > + if (value) { > + int offset = 1; > + int int_value = 0; > + int value_len = strlen(value); > + if(value_len < (plus_plus_len + 2) > + || value[plus_plus_len] != '+' > + || (value[plus_plus_len+1] != '+' && > value[plus_plus_len+1] != '=')) { > + goto set_unexpanded; > + } > > - int_value = char2score(hash_entry->value); > - if(value[plus_plus_len+1] != '+') { > - const char *offset_s = value+(plus_plus_len+2); > - offset = char2score(offset_s); > - } > - int_value += offset; > + int_value = char2score(hash_entry->value); > + if(value[plus_plus_len+1] != '+') { > + const char *offset_s = > value+(plus_plus_len+2); > + offset = char2score(offset_s); > + } > + int_value += offset; > + > + if(int_value > INFINITY) { > + int_value = INFINITY; > + } > > - if(int_value > INFINITY) { > - int_value = INFINITY; > - } > - > - crm_info("Expanded %s=%s to %d", attr, value, int_value); > - crm_xml_add_int(msg, F_ATTRD_VALUE, int_value); > - value = crm_element_value(msg, F_ATTRD_VALUE); > - } > - > - set_unexpanded: > - if(safe_str_eq(value, hash_entry->value) && hash_entry->timer_id) { > - /* We're already waiting to set this value */ > - return; > - } > - > - crm_free(hash_entry->value); > - hash_entry->value = NULL; > - if(value != NULL) { > - hash_entry->value = crm_strdup(value); > - crm_debug("New value of %s is %s", attr, value); > + crm_info("Expanded %s=%s to %d", attr, value, > int_value); > + crm_xml_add_int(msg, F_ATTRD_VALUE, int_value); > + value = crm_element_value(msg, F_ATTRD_VALUE); > + } > + > + set_unexpanded: > + crm_free(hash_entry->value); > + hash_entry->value = NULL; > + if(value != NULL) { > + hash_entry->value = crm_strdup(value); > + crm_debug("New value of %s is %s", attr, value); > + } > + } else if (safe_str_eq(value, hash_entry->stored_value)) { > + crm_debug_2("Ignoring non-change"); > + return; > + } else if (hash_entry->timer_id != 0) { > + crm_debug_2("Update already scheduled"); > + return; > } > > stop_attrd_timer(hash_entry); > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/
Re: [Pacemaker] primary does not run alone
Hi, Lars, everybody (2011/10/11 17:24), Lars Ellenberg wrote: > DRBD has fencing policies (fencing resource-and-stonith, for example), > which, if configured, cause it to call fencing handlers (handler { fence-peer > }) > when appropriate. > > There are various fence-peer handlers. > One is the "drbd-peer-outdater", > which needs dopd, which at this point depends on the heartbeat > communication layer. > Yes, but one problem is heatbeat or crm do not get the status of drbd correctly. These are versions of my system, maybe old. drbd83-8.3.8-1.el5 heartbeat-3.0.5-1.1.el5 pacemaker-1.0.11-1.2.el5 resource-agents-3.9.2-1.1.el5 centos5.6 I checked some variables in /usr/lib/ocf/resource.d/linbit/drbd script, on going shutdown. In drbd_status() or maybe_outdate_self(), drbd recognize both roles(local and remote) correctly. $DRBD_ROLE_LOCAL and $DRBD_ROLE_LOCAL show roles "Secondary" or "Unknown". But, $OCF_RESKEY_CRM_meta_notify_master_uname or $OCF_RESKEY_CRM_meta_notify_promote_uname still show hostname which was primary. So, it writes "outdate" to local. I do not understand why $OCF_RESKEY... are needed. I think it's enogh to check only $DRBD_ROLE... variables. In the newer version, $OCF_RESKEY... are ignored? Or correct? Thanks, Nickey ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Mysql Replication Problem with HA processes
On Fri, Oct 7, 2011 at 7:34 PM, raki wrote: > Hi Andrew > > Hi Andrew > We had installed Heartbeat/Pacemaker on two Unix(Centos 5.0.2 Machines) > The RPM's we had used for installing Heartbeat ad Pacemaker are > heartbeat-3.0.3-2.3.el5.i386.rpm > pacemaker-1.0.9.1-1.15.el5.i386.rpm > Please find the crm(cluster resource manager) order and co-location we had > used. > colocation Httpd-with-Mysql inf: HttpdVIP MS_Mysql:Master > colocation Httpd-with-ip inf: HttpdVIP Httpd > colocation Mysql-with-Tomcat inf: Tomcat1 MS_Mysql:Master > colocation Tomcat-with-HttpdVIP inf: Tomcat1 HttpdVIP > order Httdp-after-HttpdVIP inf: HttpdVIP Httpd > order Httdp-after-tomcat1 inf: Httpd Tomcat1 > order MYSQL-after-HttpdVIP inf: MS_Mysql HttpdVIP > we had two nodes running HA processes (cluster with two nodes) > In on of our test scenario we tried to restart the node where Mysql-Master > is running, based on the above configuration the Ha processes restarts and > other node Mysql processes takes over the Master responsibility > And we saw the Mysql replication stops working and in the Slave status we > found this error duplicate entry for the key values. > Please help us regarding this. > Error Description 'Duplicate entry '2083' for key 1' on query. Default > database: 'MSF_DB'. I'm not really skilled in the workings of mysql replication. Perhaps the author of the RA can comment? > I ca also provide what ever information required. > waiting for you response on this. > Rakesh ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Possible bug with mandatory ordering involving stateful (i.e. master-slave) resources
On Fri, Oct 7, 2011 at 2:05 AM, King, Christopher wrote: > Possible bug with mandatory ordering involving stateful (i.e. master-slave) > resources > > > > I have a 2-node cluster (we are running the SLES 11 HA extension, so the > pacemaker version is 1.1.2) in which a master-slave resource is dependent on > a clone resource via a mandatory ordering constraint. From “crm configure > show”: > > > > primitive dummy ocf:heartbeat:Dummy \ > > op monitor interval="15s" \ > > op start interval="0" timeout="40s" \ > > op stop interval="0" timeout="60s" > > > > primitive statefuldummy ocf:heartbeat:Stateful \ > > op start timeout="1800s" \ > > op timeout="45s" \ > > op monitor interval="10s" timeout="60s" \ > > op promote timeout="45s" \ > > op demote timeout="30s" > > > > ms dummy-ms statefuldummy \ > > meta target-role="Started" master-max="1" master-node-max="1" > clone-max="2" clone-node-max="1" notify="false" ordered="false" > globally-unique="false" is-managed="true" > > > > clone dummy-clone dummy \ > > meta target-role="Started" > > > > order dummy-order inf: dummy-clone dummy-ms > > (I reproduced the problem we are experiencing with dummy resources to try > and eliminate the RAs for our real resources as the source of the issue.) > > > > The order of events is as follows: > > 1) Force a shutdown of the dummy-clone via “crm resource stop > dummy-clone” > > 2) Logs show that the crm stops both the master and slave statefuldummy > resources of the dummy-ms. Good. > > 3) Logs show that the crm stops the dummy-clone resources. Good. > > 4) Logs immediately show that the crm starts the master and slave > statefuldummy resources of the dummy-ms. Bad. > > 5) Logs show the crm stopping the statefuldumy resources again. Good? > > > > Has anyone seen something similar? My understanding of the ordering > constraints tells me that event #4 is erroneous behaviour. Correct. Since you're a SLES customer, I'd advise you to contact SUSE directly - they should be able to give it the proper attention and escalate upstream if its not already fixed. > I would not > expect the statefuldummy resources to be restarted until a “crm resource > start dummy-clone” command is issued. If I have other types of resources > dependent on the clone, such as another clone or a group, they behave as I > would expect. It seems to be only with master-slave resources that the crm > tries to start the resource inappropriately. > > > > In our real cluster, the master-slave returns an error (OCF_ERR_GENERIC) > when it is started while its prerequisite resource is not started. In this > case, event#5 does not happen, and the master-slave is never again > restarted, even after the prerequisite clone resource is restarted via “crm > resource start ”. > > > > Thanks for your help, > > Chris King > > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Debian Unstable (sid) Problem with Pacemaker/Corosync Apache HA-Load Balanced cluster
I'd be checking your apache logs, my guess is that it doesn't like the config. Or see where/why the Apache RA could be returning 1. On Mon, Oct 3, 2011 at 5:58 PM, Miltiadis Koutsokeras wrote: > Hi again, > > I have gathered all interesting config and log files to a single archive. > See the attachment. Thanks in advance for any help/advise. > > Miltos > > On 10/02/2011 06:19 PM, Miltiadis Koutsokeras wrote: >> >> Hi Nick, >> >> Here is the output of the "crm configure show": >> >> node node-0 >> node node-1 >> primitive Apache2 ocf:heartbeat:apache \ >> params configfile="/etc/apache2/apache2.conf" \ >> op monitor interval="1min" \ >> meta target-role="Started" >> primitive ClusterIP ocf:heartbeat:IPaddr2 \ >> params ip="192.168.0.100" cidr_netmask="32" \ >> op monitor interval="30s" \ >> meta target-role="Started" >> colocation Apache2-ClusterIP-colocation inf: Apache2 ClusterIP >> order Apache2-after-ClusterIP inf: ClusterIP Apache2 >> property $id="cib-bootstrap-options" \ >> dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \ >> cluster-infrastructure="openais" \ >> expected-quorum-votes="2" \ >> stonith-enabled="false" \ >> no-quorum-policy="ignore" >> rsc_defaults $id="rsc-options" \ >> resource-stickiness="100" >> >> If you wish anything else, please feel free to ask. >> >> On 10/01/2011 02:50 PM, Nick Khamis wrote: >>> >>> Can you post your crm please. >>> >>> Nick. >>> >>> On Sat, Oct 1, 2011 at 6:32 AM, Miltiadis Koutsokeras >>> wrote: Hello everyone, My goal is to build a Round Robin balanced, HA Apache Web server cluster. The main purpose is to balance HTTP requests evenly between the nodes and have one machine pickup all requests if and ONLY if the others are not available at the moment. The cluster will be accessible only from internal network. Any advise on this will be highly appreciated (resources to use, services to install and configure etc.). After walking through ClusterLabs documentation, I think the proper deployment is an active/active Pacemaker managed cluster. I'm trying to follow the "Cluster from scratch" article in order to build a 2 node cluster on an experimental setup: 2 GNU/Linux Debian Unstable (sid) Virtual Machines (Kernel 3.0.0-1-686-pae, Apache/2.2.21 (Debian)) on same LAN network. node-0 IP: 192.168.0.101 node-1 IP: 192.168.0.102 Desired Cluster Virtual IP: 192.168.0.100 The two nodes are setup to communicate with proper SSH keys and it works flawlessly. Also they can communicate with short names: root@node-0:~# ssh node-1 -- hostname node-1 root@node-1:~# ssh node-0 -- hostname node-0 My problem is that although I've reached the part where you have the ClusterIP resource setup properly, the Apache resource does not get started in either node. The logs do not have a message explaining the failure in detail, even with debug messages enabled. All related messages report unknown errors while trying to start the service and after a while the cluster manager gives up. From the messages it seems like the manager is getting unexpected exit codes from the Apache resource. The server-status URL is accessible from 127.0.0.1 in both nodes. root@node-0:~# crm_mon -1 Last updated: Fri Sep 30 14:04:55 2011 Stack: openais Current DC: node-1 - partition with quorum Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f 2 Nodes configured, 2 expected votes 2 Resources configured. Online: [ node-1 node-0 ] ClusterIP (ocf::heartbeat:IPaddr2): Started node-1 Failed actions: Apache2_monitor_0 (node=node-0, call=3, rc=1, status=complete): unknown error Apache2_start_0 (node=node-0, call=5, rc=1, status=complete): unknown error Apache2_monitor_0 (node=node-1, call=8, rc=1, status=complete): unknown error Apache2_start_0 (node=node-1, call=10, rc=1, status=complete): unknown error Let's checkout the logs for this resource: root@node-0:~# grep ERROR.*Apache2 /var/log/corosync/corosync.log (Nothing) root@node-0:~# grep WARN.*Apache2 /var/log/corosync/corosync.log Sep 30 14:04:23 node-0 lrmd: [2555]: WARN: Managed Apache2:monitor process 2802 exited with return code 1. Sep 30 14:04:30 node-0 lrmd: [2555]: WARN: Managed Apache2:start process 2942 exited with return code 1. root@node-1:~# grep ERROR.*Apache2 /var/log/corosync/corosync.log Sep 30 14:04:23 node-1 pengine: [1676]: ERROR: native_create_actions: Resource Apache2 (ocf::apache) is active on 2 nodes attempting recovery root@nod
Re: [Pacemaker] Ignoring expired failure
On Sat, Oct 1, 2011 at 8:14 AM, Proskurin Kirill wrote: > Hello all. > > corosync-1.4.1 > pacemaker-1.1.5 > pacemaker runs with "ver: 1" > > I run again on monitoring fail and still don`t know why it happends. > Details are here: > http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg09986.html > > Some info: > I twice run on situation then pacemaker thinks what resource is started but > it is not. We use slightly modifed version of "anything" agent for our > scripts but they are aware of OCF return codes and other staff. > > I run monitoring by our agent from console: > > # env -i ; OCF_ROOT=/usr/lib/ocf > OCF_RESKEY_binfile=/usr/local/mpop/bin/my/tranprocessor.pl > /usr/lib/ocf/resource.d/mail.ru/generic monitor > # generic[14992]: DEBUG: default monitor : 7 > > > But this time I see in logs: > Oct 01 02:00:12 mysender34.mail.ru pengine: [26301]: notice: unpack_rsc_op: > Ignoring expired failure tranprocessor_stop_0 (rc=-2, > magic=2:-2;121:690:0:4c16dc39-1fd3-41f2-b582-0236f6b6eccc) on > mysender34.mail.ru > > So Pacemaker knows what resource may be down but ignoring it. Why? Its not ignoring it, you're preventing Pacemaker from doing anything about it by having a broken RA (stop action doesn't work) and not allowing/configuring fencing. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] gfs2 in the centos 6.0
On Sun, Oct 9, 2011 at 12:23 AM, Viacheslav Biriukov wrote: > Hi all. > Now in the pacemaker documentation you propose to use CMAN instead of the > pacemaker crm for the gfs2 > (http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08s02.html). No, cman instead of Pacemaker's home grown membership and quorum plugin. > But in the google cache we can find the next link > - http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08s03.html > What does this mean? It means you shouldn't go digging around in google caches :-) Rightly or wrongly, the decision was made to no longer ship the .pcmk controld variants in Fedora and RHEL. At that point, it no longer made much sense to base the document on them. > Does pacemaker:controld solution is'n stable? It is stable in that it works, and SUSE seems very happy with it. But it was only ever an intermediate step towards a stack that exclusively used corosync for membership and quorum. The only thing that really matters is that Pacemaker and the controlds get membership/quorum from the same source. It turned out that adding CMAN support to Pacemaker was the simplest way to achieve that on the most distros. > Can we go > production with this in the Centos 6.0? Sure. > > Tnx > -- > Viacheslav Biriukov > BR > http://biriukov.com > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Creating dlm_controld.pcmk
On Wed, Oct 12, 2011 at 8:45 AM, Nick Khamis wrote: > Hello Everyone, > > I have compiled the cluster stack from source and now trying to setup > ocfs2 however, I noticed that I do not have > dlm_controld.pcmk set-up. I was wondering if someone could shed some > light not this please? ocfs2, o2cb, dlm > configsys etc.. are all working manually. You'd need to give us more information about your setup to be able to comment. > > Thanks in Advance, > > Nick. > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Creating dlm_controld.pcmk
Hello Everyone, I have compiled the cluster stack from source and now trying to setup ocfs2 however, I noticed that I do not have dlm_controld.pcmk set-up. I was wondering if someone could shed some light not this please? ocfs2, o2cb, dlm configsys etc.. are all working manually. Thanks in Advance, Nick. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [Linux-HA] crm_master triggering assert section != NULL
Hi Yves, this is really a question for the Pacemaker list, so I'm cross-posting there. Please follow up on that list. On 2011-10-11 18:33, Yves Trudeau wrote: > Hi, > I started to have issues with crm_master with Pacemaker 1.0.11. I > think I traced it down to the following problem. I know crm_master is > supposed to be called within the resource script, calling manually helps > to illustrate the problem. > > root@testvirtbox1:~# /usr/sbin/crm_master -l reboot -v 1000 -r > p_MySQL_replication:0 > root@testvirtbox1:~# /usr/local/sbin/crm_master -r > 'p_MySQL_Replication:0' -G >name=master-p_MySQL_Replication:0 value=(null) > Error performing operation: cib object missing Er, why do you evidently have two versions of crm_master installed in two different paths? > and in daemon.log: > > Oct 11 12:17:41 testvirtbox1 crm_attribute: [21986]: info: Invoked: > crm_attribute -N testvirtbox1 -n master-p_MySQL_Replication:0 -G > Oct 11 12:17:41 testvirtbox1 crm_attribute: [21986]: ERROR: crm_abort: > read_attr: Triggered assert at cib_attrs.c:297 : section != NULL > > > while in cid I found this part: > > > > value="true"/> > name="master-p_MySQL_replication:0" value="1000"/> > > > > Is this a problem with my CIB or a bug in crm_attribute? Until > recently, I pretty sure this was working correctly, I don't know what > triggered the problem. crm_verify -L -V returns nothing. Odd, but the CIB snippet is incomplete and inconclusive. A full "cibadmin -Q" dump, uploaded to pastebin, would be helpful. Will you be at Percona Live in London later this month? Cheers, Florian -- Need help with High Availability? http://www.hastexo.com/now ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Postgres RA won't start
I don't have too much of experience with pacemaker in Devian. I'd also suggest getting the latest version of pgsql RA from git, though if your basic package is too old there could be conflicts. On Oct 11, 2011 9:11 AM, "Amar Prasovic" wrote: > > >What version of resource-agents package do you use? Old version of pgsql >> >depended on fuser tool installed, otherway it could fail with that error >> >code. >> > > Hello Serge, > > thank you for your answer. > > I don't have any resource-agents installed. The system is Debian Squeeze > 6.0.3 and it automatically installed cluster-agents 1.0.3-3.1 > > When I try to install resource-agents I run into dependency problems: > > webnode01 postgresql # apt-get install resource-agents > Reading package lists... Done > Building dependency tree > Reading state information... Done > Some packages could not be installed. This may mean that you have > requested an impossible situation or if you are using the unstable > distribution that some required packages have not yet been created > or been moved out of Incoming. > The following information may help to resolve the situation: > > The following packages have unmet dependencies: > resource-agents : Depends: libplumb2 but it is not going to be installed >Depends: libplumbgpl2 but it is not going to be > installed > E: Broken packages > > When I try to install libplumb2, the installation wants to remove > pacemaker: > > webnode01 postgresql # apt-get install libplumb2 > Reading package lists... Done > Building dependency tree > Reading state information... Done > The following packages were automatically installed and are no longer > required: > libsensors4 libsnmp15 libheartbeat2 corosync libnspr4-0d libtimedate-perl > libsnmp-base openhpid libcurl3 libssh2-1 lm-sensors libopenhpi2 fancontrol > libopenipmi0 libperl5.10 libesmtp5 libcorosync4 libnet1 libnss3-1d > Use 'apt-get autoremove' to remove them. > The following extra packages will be installed: > libpils2 > The following packages will be REMOVED: > cluster-agents cluster-glue libcluster-glue pacemaker > The following NEW packages will be installed: > libpils2 libplumb2 > 0 upgraded, 2 newly installed, 4 to remove and 0 not upgraded. > Need to get 115 kB of archives. > After this operation, 5,874 kB disk space will be freed. > Do you want to continue [Y/n]? n > Abort. > > Can I do something with fuser tools? > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Postgres RA won't start
On 2011-10-11 17:10, Amar Prasovic wrote: > > >What version of resource-agents package do you use? Old version of > pgsql > >depended on fuser tool installed, otherway it could fail with that > error > >code. > > > Hello Serge, > > thank you for your answer. > > I don't have any resource-agents installed. The system is Debian Squeeze > 6.0.3 and it automatically installed cluster-agents 1.0.3-3.1 > > When I try to install resource-agents I run into dependency problems: Yeah, that's a bit awkward. The squeeze package is called cluster-agents, but then it was decided that the package should be named "resource-agents" as on all other platforms, and that's the current name in squeeze-backports. > webnode01 postgresql # apt-get install resource-agents Do "apt-get -t squeeze-backports install resource-agents" instead. Florian -- Need help with High Availability? http://www.hastexo.com/now ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Postgres RA won't start
> >What version of resource-agents package do you use? Old version of pgsql > >depended on fuser tool installed, otherway it could fail with that error > >code. > Hello Serge, thank you for your answer. I don't have any resource-agents installed. The system is Debian Squeeze 6.0.3 and it automatically installed cluster-agents 1.0.3-3.1 When I try to install resource-agents I run into dependency problems: webnode01 postgresql # apt-get install resource-agents Reading package lists... Done Building dependency tree Reading state information... Done Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies: resource-agents : Depends: libplumb2 but it is not going to be installed Depends: libplumbgpl2 but it is not going to be installed E: Broken packages When I try to install libplumb2, the installation wants to remove pacemaker: webnode01 postgresql # apt-get install libplumb2 Reading package lists... Done Building dependency tree Reading state information... Done The following packages were automatically installed and are no longer required: libsensors4 libsnmp15 libheartbeat2 corosync libnspr4-0d libtimedate-perl libsnmp-base openhpid libcurl3 libssh2-1 lm-sensors libopenhpi2 fancontrol libopenipmi0 libperl5.10 libesmtp5 libcorosync4 libnet1 libnss3-1d Use 'apt-get autoremove' to remove them. The following extra packages will be installed: libpils2 The following packages will be REMOVED: cluster-agents cluster-glue libcluster-glue pacemaker The following NEW packages will be installed: libpils2 libplumb2 0 upgraded, 2 newly installed, 4 to remove and 0 not upgraded. Need to get 115 kB of archives. After this operation, 5,874 kB disk space will be freed. Do you want to continue [Y/n]? n Abort. Can I do something with fuser tools? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Postgres RA won't start
On 2011-10-11 16:10, Amar Prasovic wrote: > Hello everyone, > > I tried to configure postgres RA and I ran into some problems. > > [...] > > in crm_mon > Online: [ webnode02 webnode01 ] > > Master/Slave Set: drbd_cluster > Masters: [ webnode01 ] > Slaves: [ webnode02 ] > Resource Group: cluster_1 > fs_res (ocf::heartbeat:Filesystem):Started webnode01 > ClusterIP (ocf::heartbeat:IPaddr2): Started webnode01 > nginx_res (ocf::heartbeat:nginx):Started webnode01 > postgres_res (ocf::heartbeat:pgsql): Stopped > > Failed actions: > postgres_res_start_0 (node=webnode01, call=84, rc=5, > status=complete): not installed > postgres_res_start_0 (node=webnode02, call=66, rc=5, > status=complete): not installed There are just 4 scenarios in which pgsql returns OCF_ERR_INSTALLED: - The resource agent is not installed or is not executable (unlikely); - pgctl or psql are not installed or not executable; - the configuration file does not exist or is not readable during a non-probe; - the username identified by the "pgdba" resource parameter does not resolve to a uid. All of those do log error messages to the log though. You can grep for ERROR in your logs, it should turn up what went wrong. Cheers, Florian -- Need help with Pacemaker? http://www.hastexo.com/now ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Postgres RA won't start
What version of resource-agents package do you use? Old version of pgsql depended on fuser tool installed, otherway it could fail with that error code. On Oct 11, 2011 8:12 AM, "Amar Prasovic" wrote: > Hello everyone, > > I tried to configure postgres RA and I ran into some problems. > > I configured several resources in my cluster config where pgsql was set to > run last, after DRBD, Filesystem, IPAddr2 and nginx. > > Here is how it looks like in crm configure: > > crm(live)configure# show > node webnode01 \ > attributes standby="off" > node webnode02 \ > attributes standby="off" > primitive ClusterIP ocf:heartbeat:IPaddr2 \ > params ip="192.168.10.80" cidr_netmask="32" \ > op monitor interval="30s" > primitive drbd_res ocf:linbit:drbd \ > params drbd_resource="yorxs" \ > op monitor interval="60s" \ > op start interval="0s" timeout="240s" \ > op stop interval="0s" timeout="100s" > primitive fs_res ocf:heartbeat:Filesystem \ > params device="/dev/drbd1" directory="/srv" fstype="ext4" \ > op start interval="0s" timeout="60s" \ > op stop interval="0s" timeout="60s" \ > op monitor interval="60s" timeout="40s" > primitive nginx_res ocf:heartbeat:nginx \ > params configfile="/etc/nginx/nginx.conf" > httpd="/usr/local/sbin/nginx" status10url="http:/127.0.0.1" \ > op monitor interval="10s" timeout="30s" \ > op start interval="0" timeout="40s" \ > op stop interval="0" timeout="60s" > primitive postgres_res ocf:heartbeat:pgsql \ > params psql="/bin/psql" pgdata="/var/lib/postgres/8.4/main" > logfile="/var/log/postgres/postgres.log" \ > op start interval="0" timeout="120s" \ > op stop interval="0" timeout="120s" \ > op monitor interval="30s" timeout="30s" > group cluster_1 fs_res ClusterIP nginx_res postgres_res > ms drbd_cluster drbd_res \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" > location prefer_webnode01 cluster_1 50: webnode01 > location prefer_webnode01_drbd drbd_cluster 50: webnode01 > colocation cluster_1_on_drbd inf: cluster_1 drbd_cluster:Master > order cluster_1_after_drbd inf: drbd_cluster:promote cluster_1:start > property $id="cib-bootstrap-options" \ > dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \ > cluster-infrastructure="openais" \ > expected-quorum-votes="2" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" \ > last-lrm-refresh="1318326771" > > However, when I run this config, everything except for pgsql starts without > problems. For pgsql, I got the following error: > > in crm_mon > Online: [ webnode02 webnode01 ] > > Master/Slave Set: drbd_cluster > Masters: [ webnode01 ] > Slaves: [ webnode02 ] > Resource Group: cluster_1 > fs_res (ocf::heartbeat:Filesystem):Started webnode01 > ClusterIP (ocf::heartbeat:IPaddr2): Started webnode01 > nginx_res (ocf::heartbeat:nginx):Started webnode01 > postgres_res (ocf::heartbeat:pgsql): Stopped > > Failed actions: > postgres_res_start_0 (node=webnode01, call=84, rc=5, status=complete): > not installed > postgres_res_start_0 (node=webnode02, call=66, rc=5, status=complete): > not installed > > in /var/log/syslog > webnode01 log # cat syslog |grep postgres_res > Oct 11 11:39:34 webnode01 crmd: [921]: info: do_lrm_rsc_op: Performing > key=6:93:7:933bf2ab-00d0-435c-a24f-85897e0c9725 op=postgres_res_monitor_0 ) > Oct 11 11:39:34 webnode01 lrmd: [914]: info: rsc:postgres_res:27: probe > Oct 11 11:39:34 webnode01 crmd: [921]: info: process_lrm_event: LRM > operation postgres_res_monitor_0 (call=27, rc=7, cib-update=36, > confirmed=true) not running > Oct 11 11:39:50 webnode01 crmd: [921]: info: do_lrm_rsc_op: Performing > key=39:96:0:933bf2ab-00d0-435c-a24f-85897e0c9725 op=postgres_res_start_0 ) > Oct 11 11:39:50 webnode01 lrmd: [914]: info: rsc:postgres_res:39: start > Oct 11 11:39:50 webnode01 crmd: [921]: info: process_lrm_event: LRM > operation postgres_res_start_0 (call=39, rc=5, cib-update=47, > confirmed=true) not installed > Oct 11 11:39:50 webnode01 attrd: [918]: info: find_hash_entry: Creating > hash entry for fail-count-postgres_res > Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_trigger_update: Sending > flush op to all hosts for: fail-count-postgres_res (INFINITY) > Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_perform_update: Sent > update 63: fail-count-postgres_res=INFINITY > Oct 11 11:39:50 webnode01 attrd: [918]: info: find_hash_entry: Creating > hash entry for last-failure-postgres_res > Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_trigger_update: Sending > flush op to all hosts for: last-failure-postgres_res (1318325990) > Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_perform_update: Sent > update 66: last-failure-postgres_res=1318325990 > Oct 11 11:39:50 webnode01 crmd
[Pacemaker] Postgres RA won't start
Hello everyone, I tried to configure postgres RA and I ran into some problems. I configured several resources in my cluster config where pgsql was set to run last, after DRBD, Filesystem, IPAddr2 and nginx. Here is how it looks like in crm configure: crm(live)configure# show node webnode01 \ attributes standby="off" node webnode02 \ attributes standby="off" primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip="192.168.10.80" cidr_netmask="32" \ op monitor interval="30s" primitive drbd_res ocf:linbit:drbd \ params drbd_resource="yorxs" \ op monitor interval="60s" \ op start interval="0s" timeout="240s" \ op stop interval="0s" timeout="100s" primitive fs_res ocf:heartbeat:Filesystem \ params device="/dev/drbd1" directory="/srv" fstype="ext4" \ op start interval="0s" timeout="60s" \ op stop interval="0s" timeout="60s" \ op monitor interval="60s" timeout="40s" primitive nginx_res ocf:heartbeat:nginx \ params configfile="/etc/nginx/nginx.conf" httpd="/usr/local/sbin/nginx" status10url="http:/127.0.0.1" \ op monitor interval="10s" timeout="30s" \ op start interval="0" timeout="40s" \ op stop interval="0" timeout="60s" primitive postgres_res ocf:heartbeat:pgsql \ params psql="/bin/psql" pgdata="/var/lib/postgres/8.4/main" logfile="/var/log/postgres/postgres.log" \ op start interval="0" timeout="120s" \ op stop interval="0" timeout="120s" \ op monitor interval="30s" timeout="30s" group cluster_1 fs_res ClusterIP nginx_res postgres_res ms drbd_cluster drbd_res \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" location prefer_webnode01 cluster_1 50: webnode01 location prefer_webnode01_drbd drbd_cluster 50: webnode01 colocation cluster_1_on_drbd inf: cluster_1 drbd_cluster:Master order cluster_1_after_drbd inf: drbd_cluster:promote cluster_1:start property $id="cib-bootstrap-options" \ dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ last-lrm-refresh="1318326771" However, when I run this config, everything except for pgsql starts without problems. For pgsql, I got the following error: in crm_mon Online: [ webnode02 webnode01 ] Master/Slave Set: drbd_cluster Masters: [ webnode01 ] Slaves: [ webnode02 ] Resource Group: cluster_1 fs_res (ocf::heartbeat:Filesystem):Started webnode01 ClusterIP (ocf::heartbeat:IPaddr2): Started webnode01 nginx_res (ocf::heartbeat:nginx):Started webnode01 postgres_res (ocf::heartbeat:pgsql): Stopped Failed actions: postgres_res_start_0 (node=webnode01, call=84, rc=5, status=complete): not installed postgres_res_start_0 (node=webnode02, call=66, rc=5, status=complete): not installed in /var/log/syslog webnode01 log # cat syslog |grep postgres_res Oct 11 11:39:34 webnode01 crmd: [921]: info: do_lrm_rsc_op: Performing key=6:93:7:933bf2ab-00d0-435c-a24f-85897e0c9725 op=postgres_res_monitor_0 ) Oct 11 11:39:34 webnode01 lrmd: [914]: info: rsc:postgres_res:27: probe Oct 11 11:39:34 webnode01 crmd: [921]: info: process_lrm_event: LRM operation postgres_res_monitor_0 (call=27, rc=7, cib-update=36, confirmed=true) not running Oct 11 11:39:50 webnode01 crmd: [921]: info: do_lrm_rsc_op: Performing key=39:96:0:933bf2ab-00d0-435c-a24f-85897e0c9725 op=postgres_res_start_0 ) Oct 11 11:39:50 webnode01 lrmd: [914]: info: rsc:postgres_res:39: start Oct 11 11:39:50 webnode01 crmd: [921]: info: process_lrm_event: LRM operation postgres_res_start_0 (call=39, rc=5, cib-update=47, confirmed=true) not installed Oct 11 11:39:50 webnode01 attrd: [918]: info: find_hash_entry: Creating hash entry for fail-count-postgres_res Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-postgres_res (INFINITY) Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_perform_update: Sent update 63: fail-count-postgres_res=INFINITY Oct 11 11:39:50 webnode01 attrd: [918]: info: find_hash_entry: Creating hash entry for last-failure-postgres_res Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-postgres_res (1318325990) Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_perform_update: Sent update 66: last-failure-postgres_res=1318325990 Oct 11 11:39:50 webnode01 crmd: [921]: info: do_lrm_rsc_op: Performing key=4:97:0:933bf2ab-00d0-435c-a24f-85897e0c9725 op=postgres_res_stop_0 ) Oct 11 11:39:50 webnode01 lrmd: [914]: info: rsc:postgres_res:40: stop Oct 11 11:39:50 webnode01 crmd: [921]: info: process_lrm_event: LRM operation postgres_res_stop_0 (call=40, rc=0, cib-update=49, confirmed=true) ok Additional info: /etc/postgresql, /etc/postgresql-common and /var/
Re: [Pacemaker] primary does not run alone
On Tue, Oct 11, 2011 at 09:09:52AM +0900, H.Nakai wrote: > Hi, Andreas, Lars, and everybody > > I will try newer version. > > But, I want below. DRBD has fencing policies (fencing resource-and-stonith, for example), which, if configured, cause it to call fencing handlers (handler { fence-peer }) when appropriate. There are various fence-peer handlers. One is the "drbd-peer-outdater", which needs dopd, which at this point depends on the heartbeat communication layer. Then there is the crm-fence-peer.sh script, which works by setting a pacemaker location constraint instead of actually setting the peer outdated. See if that works like you think it should. > Primary > demote > wait 5-10 seconds > check Secondary is promoted or > still secondary or disconnected > if Secondary is promoted and still primary, >set local "outdate" > (This means shutdown only Primary) > if Secondary is still secondary or disconnected, > not set local "outdate" > (This means shutdown both of Primary and Secondary) > disconnect > shutdown > Seconday > check Primary > if Primary is primary, set local "outdate" > if Primary is demoted(secondary), not set "outdate" > disconnect > shutdown > > (2011/10/08 7:14), Lars Ellenberg wrote: > > On Fri, Oct 07, 2011 at 11:29:57PM +0200, Andreas Kurz wrote: > >> Hello, > >> > >> On 10/07/2011 04:51 AM, H.Nakai wrote: > >> > Hi, I'm from Japan, in trouble. > >> > In the case blow, server which was primary > >> > sometimes do not run drbd/heartbeat. > >> > > >> > Server A(primary), Server B(secondary) is running. > >> > Shutdown A and immediately Shutdown B. > >> > Switch on only A, it dose not run drbd/heartbeat. > >> > > >> > It may happen when one server was broken. > >> > > >> > I'm using, > >> > drbd83-8.3.8-1.el5 > >> > heartbeat-3.0.5-1.1.el5 > >> > pacemaker-1.0.11-1.2.el5 > >> > resource-agents-3.9.2-1.1.el5 > >> > centos5.6 > >> > Servers are using two LANs(eth0, eth1) and not using serial cable. > >> > > >> > I checked /usr/lib/ocf/resource.d/linbit/drbd, > >> > and insert some debug codes. > >> > At drbd_stop(), in while loop, > >> > only when "Unconfigured", break and call maybe_outdate_self(). > >> > But sometimes, $OCF_RESKEY_CRM_meta_notify_master_uname or > >> > $OCF_RESKEY_CRM_meta_notify_promote_uname are not null. > >> > So, at maybe_outdate_self(), it is going to set "outdate". > >> > And, it always show warning messages below. But, "outdated" flag is set. > >> > "State change failed: Disk state is lower than outdated" > >> > " state = { cs:StandAlone ro:Secondary/Unknown ds:Diskless/DUnknown r--- > >> > }" > >> > "wanted = { cs:StandAlone ro:Secondary/Unknown ds:Outdated/DUnknown r--- > >> > }" > > > > those are expected and harmless, even though I admit they are annoying. > > > >> > I do not want to be set outdated flag, when shutdown both of them. > >> > I want to know what program set $OCF_RESKEY_CRM_* variables, > >> > with what condition set these variables, > >> > and when these variables are set. > >> > >> you need a newer OCF resource agent, at least from DRBD 8.3.9. There was > >> the new parameter "stop_outdates_secondary" (defaults to true) > >> introduced ... set this to false to change the behavior of your setup > >> and be warned: this increases the change to come up with old (outdated) > >> data. > > > > BTW, that default has changed to false, > > because of a bug in some version of pacemaker, > > which got the environment for stop operations wrong. > > pacemaker 1.0.11 is ok again, iirc. > > > > Anyways, if you simply go to DRBD 8.3.11, you should be good. > > If you want only the agent script, grab it there: > > http://git.drbd.org/drbd-8.3.git/?a=blob_plain;f=scripts/drbd.ocf > > > > Thanks, > > Nickey > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [DRBD-user] examples of dual primary DRBD
On 10/11/11 04:35, Andrew Beekhof wrote: On Mon, Oct 10, 2011 at 9:12 PM, Florian Haas wrote: On 2011-10-08 15:55, Bart Coninckx wrote: On 10/08/11 00:25, Lars Ellenberg wrote: On Fri, Oct 07, 2011 at 10:21:08PM +0200, Bart Coninckx wrote: On 10/06/11 22:03, Florian Haas wrote: On 2011-10-06 21:43, Bart Coninckx wrote: Hi all, would you mind sending me examples of your crm config for a dual primary DRBD resource? I used the one on http://www.drbd.org/users-guide/s-ocfs2-pacemaker.html and on http://www.clusterlabs.org/wiki/Dual_Primary_DRBD_%2B_OCFS2 and they both result into split brain, except for when I start drbd manually first. They clearly should not. Rather than soliciting other people's configurations and then try to adapt yours based on that, why don't you upload _your_ CIB (not just a "crm configure dump", but a full "cibadmin -Q") and your DRBD configuration to your pastebin/pastie/fpaste and let people tell you where your problem is? OK, I posted the drbd.conf on http://pastebin.com/SQe9YxhY cibadmin -Q is on http://pastebin.com/gTZqsACq The split brain logging is on http://pastebin.com/7unKKkdi . I somehow think you added some "--force" or "--overwrite-data-of-peer" to some drbdadm/drbdsetup primary invocation? Could this be some sort of timing issue? Manually things are find, but there are some seconds in between the primary promotions. OK, seems to be some sort of timing issue. I "fixed" this by adding a "sleep 1" in the RA right before the "do_drbdadm primary $DRBD_RESOURCE" line. I'm surprised though that I'm the first one to run into this. Er, wait. I'm cross-posting this to the Pacemaker list on a hunch. Andrew, in Boston last year you mentioned you were planning to implement a change to Master/Slave sets in which, iirc, startup and promotion would happen in one fell swoop (I believe the NTT folks made a compelling case for this). Has that change ever been implemented? Alas no. I still have intentions of doing so, but I was consumed with Matahari for most of this year and have been playing catch-up ever since. If you were inclined, you could (re)create a bug for this in http://bugs.clusterlabs.org And if so, at which Pacemaker version? Is there a configuration option to revert back to the old behavior where the resource would be started first, and then promotion would occur some time after that? Cheers, Florian -- Need help with High Availability? http://www.hastexo.com/now ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ drbd-user mailing list drbd-u...@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user Florian, Does this mean you thought this problem could have been the result of changes done by Andrew to the DRBD RA? But sindce he hasn't done them yet, isn't? thx, B. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker