Re: [Pacemaker] Live demo of Pacemaker Cloud on Fedora: Friday August 5th at 8am PST
Steven, Are you planning on recording/taping it if I want to watch it later? Thanks, Bob From: Steven Dake sd...@redhat.com To: pcmk-cl...@oss.clusterlabs.org Cc: aeolus-de...@lists.fedorahosted.org; Fedora Cloud SIG cl...@lists.fedoraproject.org; open...@lists.linux-foundation.org open...@lists.linux-foundation.org; The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wednesday, August 3, 2011 9:42 AM Subject: [Pacemaker] Live demo of Pacemaker Cloud on Fedora: Friday August 5th at 8am PST Extending a general invitation to the high availability communities and other cloud community contributors to participate in a live demo I am giving on Friday August 5th 8am PST (GMT-7). Demo portion of session is 15 minutes and will be provided first followed by more details of our approach to high availability. I will use elluminate to show the demo on my desktop machine. To make elluminate work, you will need icedtea-web installed on your system which is not typically installed by default. You will also need a conference # and bridge code. Please contact me offlist with your location and I'll provide you with a hopefully toll free conference # and bridge code. Elluminate link: https://sas.elluminate.com/m.jnlp?sid=819password=M.13AB020AEBE358D265FD925A07335F Bridge Code: Please contact me off list with your location and I'll respond back with dial-in information. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Fw: Configuration for FS over DRBD over LVM
One correction: I removed the location constraint and simply went with this: colocation coloc-rule-w-master inf: glance-repos ms_drbd:Master glance-repos-fs-group order glance-order-fs-after-drbd inf: glance-repos:start ms_drbd:promote glance-repos-fs-group:start order glance-order-fs-after-drbd2 inf: glance-repos-fs-group:stop ms_drbd:demote ms_drbd:stop glance-repos:stop I called out the stop of DRBD before the stop of LVM. The syslog attached previously is for this configuration. Thanks, Bob From: Bob Schatz bsch...@yahoo.com To: pacemaker@oss.clusterlabs.org pacemaker@oss.clusterlabs.org Sent: Wednesday, July 20, 2011 11:32 AM Subject: [Pacemaker] Fw: Configuration for FS over DRBD over LVM I tried another test based on this thread: http://www.gossamer-threads.com/lists/linuxha/pacemaker/65928?search_string=lvm%20drbd;#65928 I removed the location constraint and simply went with this: colocation coloc-rule-w-master inf: glance-repos ms_drbd:Master glance-repos-fs-group order glance-order-fs-after-drbd inf: glance-repos:start ms_drbd:promote glance-repos-fs-group:start order glance-order-fs-after-drbd2 inf: glance-repos-fs-group:stop ms_drbd:demote glance-repos:stop The stop actions were called in this order: stop file system demote DRBD stop LVM * stop DRBD * instead of: stop file system demote DRBD stop DRBD ** stop LVM ** I see these messages in the log which I believe are debug messages based on reading other threads: pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-0-start-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-0-start-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-0-stop-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-0-stop-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-1-promote-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-1-promote-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-1-demote-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-1-demote-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-2-start-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-2-start-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-2-stop-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-2-stop-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-0-stop-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-0-stop-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-0-start-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-0-start-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-1-demote-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-1-demote-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-1-promote-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-1-promote-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-2-stop-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-2-stop-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-2-start-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-2-start-end I have attached a syslog-pacemaker log of the /etc/init.d/corosync start through /etc/init.d/corosync stop sequence. Thanks, Bob - Forwarded Message - From: Bob Schatz bsch...@yahoo.com To: pacemaker@oss.clusterlabs.org pacemaker@oss.clusterlabs.org Sent: Tuesday, July 19, 2011 4:38 PM Subject: [Pacemaker] Configuration for FS over DRBD over LVM Hi, I am trying to configure an FS running on top of DRBD on top of LVM or: FS | DRBD | LVM I am using Pacemaker 1.0.8, Ubuntu 10.04 and DRBD 8.3.7. Reviewing all the manuals (Pacemaker Explained 1.0, DRBD 8.4 User Guide, etc) I came up with this Pacemaker configuration: node cnode-1-3-5 node cnode-1-3-6 primitive glance-drbd ocf:linbit:drbd \ params drbd_resource=glance-repos-drbd \ op start interval=0 timeout=240 \ op stop interval=0 timeout=100 \ op monitor
[Pacemaker] Fw: Configuration for FS over DRBD over LVM
One correction: I removed the location constraint and simply went with this: colocation coloc-rule-w-master inf: glance-repos ms_drbd:Master glance-repos-fs-group order glance-order-fs-after-drbd inf: glance-repos:start ms_drbd:promote glance-repos-fs-group:start order glance-order-fs-after-drbd2 inf: glance-repos-fs-group:stop ms_drbd:demote ms_drbd:stop glance-repos:stop I called out the stop of DRBD before the stop of LVM. The syslog attached previously is for this configuration. Thanks, Bob From: Bob Schatz bsch...@yahoo.com To: pacemaker@oss.clusterlabs.org pacemaker@oss.clusterlabs.org Sent: Wednesday, July 20, 2011 11:32 AM Subject: [Pacemaker] Fw: Configuration for FS over DRBD over LVM I tried another test based on this thread: http://www.gossamer-threads.com/lists/linuxha/pacemaker/65928?search_string=lvm%20drbd;#65928 I removed the location constraint and simply went with this: colocation coloc-rule-w-master inf: glance-repos ms_drbd:Master glance-repos-fs-group order glance-order-fs-after-drbd inf: glance-repos:start ms_drbd:promote glance-repos-fs-group:start order glance-order-fs-after-drbd2 inf: glance-repos-fs-group:stop ms_drbd:demote glance-repos:stop The stop actions were called in this order: stop file system demote DRBD stop LVM * stop DRBD * instead of: stop file system demote DRBD stop DRBD ** stop LVM ** I see these messages in the log which I believe are debug messages based on reading other threads: pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-0-start-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-0-start-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-0-stop-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-0-stop-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-1-promote-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-1-promote-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-1-demote-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-1-demote-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-2-start-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-2-start-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-2-stop-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-2-stop-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-0-stop-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-0-stop-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-0-start-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-0-start-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-1-demote-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-1-demote-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-1-promote-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-1-promote-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-2-stop-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-2-stop-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-2-start-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd2-2-start-end I have attached a syslog-pacemaker log of the /etc/init.d/corosync start through /etc/init.d/corosync stop sequence. Thanks, Bob - Forwarded Message - From: Bob Schatz bsch...@yahoo.com To: pacemaker@oss.clusterlabs.org pacemaker@oss.clusterlabs.org Sent: Tuesday, July 19, 2011 4:38 PM Subject: [Pacemaker] Configuration for FS over DRBD over LVM Hi, I am trying to configure an FS running on top of DRBD on top of LVM or: FS | DRBD | LVM I am using Pacemaker 1.0.8, Ubuntu 10.04 and DRBD 8.3.7. Reviewing all the manuals (Pacemaker Explained 1.0, DRBD 8.4 User Guide, etc) I came up with this Pacemaker configuration: node cnode-1-3-5 node cnode-1-3-6 primitive glance-drbd ocf:linbit:drbd \ params drbd_resource=glance-repos-drbd \ op start interval=0 timeout=240 \ op stop interval=0 timeout=100 \ op
[Pacemaker] Fw: Fw: Configuration for FS over DRBD over LVM
Okay, this configuration works on one node (I am waiting for a hardware problem to be fixed before testing with second node): node cnode-1-3-5 node cnode-1-3-6 primitive glance-drbd ocf:linbit:drbd \ params drbd_resource=glance-repos-drbd \ op start interval=0 timeout=240 \ op stop interval=0 timeout=100 \ op monitor interval=59s role=Master timeout=30s \ op monitor interval=61s role=Slave timeout=30s primitive glance-fs ocf:heartbeat:Filesystem \ params device=/dev/drbd1 directory=/glance-mount fstype=ext4 \ op start interval=0 timeout=60 \ op monitor interval=60 timeout=60 OCF_CHECK_LEVEL=20 \ op stop interval=0 timeout=120 primitive glance-ip ocf:heartbeat:IPaddr2 \ params ip=10.4.0.25 nic=br100:1 \ op monitor interval=5s primitive glance-repos ocf:heartbeat:LVM \ params volgrpname=glance-repos exclusive=true \ op start interval=0 timeout=30 \ op stop interval=0 timeout=30 group glance-repos-fs-group glance-fs glance-ip \ meta target-role=Started ms ms_drbd glance-drbd \ meta master-node-max=1 clone-max=2 clone-node-max=1 globally-unique=false notify=true target-role=Master colocation coloc-rule-w-master inf: ms_drbd:Master glance-repos-fs-group colocation coloc-rule-w-master2 inf: glance-repos ms_drbd:Master order glance-order-fs-after-drbd inf: glance-repos:start ms_drbd:start order glance-order-fs-after-drbd-stop inf: glance-repos-fs-group:stop ms_drbd:demote order glance-order-fs-after-drbd-stop2 inf: ms_drbd:demote ms_drbd:stop order glance-order-fs-after-drbd-stop3 inf: ms_drbd:stop glance-repos:stop order glance-order-fs-after-drbd2 inf: ms_drbd:start ms_drbd:promote order glance-order-fs-after-drbd3 inf: ms_drbd:promote glance-repos-fs-group:start property $id=cib-bootstrap-options \ dc-version=1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd \ cluster-infrastructure=openais \ expected-quorum-votes=1 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1310768814 I will let everyone know how testing goes. Thanks, Bob - Forwarded Message - From: Bob Schatz bsch...@yahoo.com To: pacemaker@oss.clusterlabs.org pacemaker@oss.clusterlabs.org Sent: Wednesday, July 20, 2011 1:38 PM Subject: [Pacemaker] Fw: Configuration for FS over DRBD over LVM One correction: I removed the location constraint and simply went with this: colocation coloc-rule-w-master inf: glance-repos ms_drbd:Master glance-repos-fs-group order glance-order-fs-after-drbd inf: glance-repos:start ms_drbd:promote glance-repos-fs-group:start order glance-order-fs-after-drbd2 inf: glance-repos-fs-group:stop ms_drbd:demote ms_drbd:stop glance-repos:stop I called out the stop of DRBD before the stop of LVM. The syslog attached previously is for this configuration. Thanks, Bob From: Bob Schatz bsch...@yahoo.com To: pacemaker@oss.clusterlabs.org pacemaker@oss.clusterlabs.org Sent: Wednesday, July 20, 2011 11:32 AM Subject: [Pacemaker] Fw: Configuration for FS over DRBD over LVM I tried another test based on this thread: http://www.gossamer-threads.com/lists/linuxha/pacemaker/65928?search_string=lvm%20drbd;#65928 I removed the location constraint and simply went with this: colocation coloc-rule-w-master inf: glance-repos ms_drbd:Master glance-repos-fs-group order glance-order-fs-after-drbd inf: glance-repos:start ms_drbd:promote glance-repos-fs-group:start order glance-order-fs-after-drbd2 inf: glance-repos-fs-group:stop ms_drbd:demote glance-repos:stop The stop actions were called in this order: stop file system demote DRBD stop LVM * stop DRBD * instead of: stop file system demote DRBD stop DRBD ** stop LVM ** I see these messages in the log which I believe are debug messages based on reading other threads: pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-0-start-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-0-start-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-0-stop-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-0-stop-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-1-promote-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-1-promote-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-1-demote-begin pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-1-demote-end pengine: [21021]: debug: text2task: Unsupported action: glance-order-fs-after-drbd-2-start-begin pengine: [21021]: debug: text2task: Unsupported action: glance
[Pacemaker] Configuration for FS over DRBD over LVM
Hi, I am trying to configure an FS running on top of DRBD on top of LVM or: FS | DRBD | LVM I am using Pacemaker 1.0.8, Ubuntu 10.04 and DRBD 8.3.7. Reviewing all the manuals (Pacemaker Explained 1.0, DRBD 8.4 User Guide, etc) I came up with this Pacemaker configuration: node cnode-1-3-5 node cnode-1-3-6 primitive glance-drbd ocf:linbit:drbd \ params drbd_resource=glance-repos-drbd \ op start interval=0 timeout=240 \ op stop interval=0 timeout=100 \ op monitor interval=59s role=Master timeout=30s \ op monitor interval=61s role=Slave timeout=30s primitive glance-fs ocf:heartbeat:Filesystem \ params device=/dev/drbd1 directory=/glance-mount fstype=ext4 \ op start interval=0 timeout=60 \ op monitor interval=60 timeout=60 OCF_CHECK_LEVEL=20 \ op stop interval=0 timeout=120 primitive glance-repos ocf:heartbeat:LVM \ params volgrpname=glance-repos exclusive=true \ op start interval=0 timeout=30 \ op stop interval=0 timeout=30 group glance-repos-fs-group glance-fs ms ms_drbd glance-drbd \ meta master-node-max=1 clone-max=2 clone-node-max=1 globally-unique=false notify=true target-role=Master location drbd_on_node1 ms_drbd \ rule $id=drbd_on_node1-rule $role=Master 100: #uname eq cnode-1-3-5 colocation coloc-rule-w-master inf: glance-repos ms_drbd:Master order glance-order-fs-after-drbd inf: glance-repos:start ms_drbd:promote glance-repos-fs-group:start property $id=cib-bootstrap-options \ dc-version=1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd \ cluster-infrastructure=openais \ expected-quorum-votes=1 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1310768814 On one node, things come up cleanly. In fact, debug messages in the agent show that the start() functions for the agent are called and exited in order (LVM start, DRBD start and Filesystem start). The problem occurs when I do /etc/init.d/corosync stop on a single node. What happens is that the stop() functions are called in this order: 1. LVM stop 2. Filesystem stop 3. DRBD stop What I have tried: 1. I tried setting the score of the order to 500 assuming that this would mean the colocation rule would hit first. Still the same problem. 2. I tried leaving off the :start and :promote options on the order line. The stop order was still LVM, Filesystem, and DRBD 3. I tried adding another colocation rule colocation coloc-rule-w-master2 inf: ms_drbd:Master glance-repos-fs-group to tie glance-repos-fs-group to the same node as DRBD. Stop still had the same issue. I assume that I will still need this rule when I add a second node to the test. Any suggestions would be appreciated. A side note, the reason I have a group for the file system is that I would like to add an application and IP address to the group once I get this working. Also, the reason I have LVM under DRBD is that I want to be able to grow the LVM volume as needed and then expand the DRBD volume. Thanks in advance, Bob___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Question regarding starting of master/slave resources and ELECTIONs
Andrew, Comments at end with BS From: Andrew Beekhof and...@beekhof.net To: Bob Schatz bsch...@yahoo.com Cc: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Fri, April 15, 2011 4:28:52 AM Subject: Re: [Pacemaker] Question regarding starting of master/slave resources and ELECTIONs On Fri, Apr 15, 2011 at 5:58 AM, Bob Schatz bsch...@yahoo.com wrote: Andrew, Thanks for the help Comments inline with BS From: Andrew Beekhof and...@beekhof.net To: Bob Schatz bsch...@yahoo.com Cc: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thu, April 14, 2011 2:14:40 AM Subject: Re: [Pacemaker] Question regarding starting of master/slave resources and ELECTIONs On Thu, Apr 14, 2011 at 10:49 AM, Andrew Beekhof and...@beekhof.net wrote: I noticed that 4 of the master/slave resources will start right away but the 5 master/slave resource seems to take a minute or so and I am only running with one node. Is this expected? Probably, if the other 4 take around a minute each to start. There is an lrmd config variable that controls how much parallelism it allows (but i forget the name). Bob It's max-children and I set it to 40 for this test to see if it would change the behavior. (/sbin/lrmadmin -p max-children 40) Thats surprising. I'll have a look at the logs. Looking at the logs, I see a couple of things: This is very bad: Apr 12 19:33:42 mgraid-S30311-1 crmd: [17529]: WARN: get_uuid: Could not calculate UUID for mgraid-s30311-0 Apr 12 19:33:42 mgraid-S30311-1 crmd: [17529]: WARN: populate_cib_nodes_ha: Node mgraid-s30311-0: no uuid found For some reason pacemaker cant get the node's uuid from heartbeat. BS I create the uuid when the node comes up. Heartbeat should have already created it before pacemaker even got started though. So we start a few things: Apr 12 19:33:41 mgraid-S30311-1 crmd: [17529]: info: do_lrm_rsc_op: Performing key=23:3:0:48aac631-8177-4cda-94ea-48dfa9b1a90f op=SSS30311:0_start_0 ) Apr 12 19:33:41 mgraid-S30311-1 crmd: [17529]: info: do_lrm_rsc_op: Performing key=49:3:0:48aac631-8177-4cda-94ea-48dfa9b1a90f op=SSJ30312:0_start_0 ) Apr 12 19:33:41 mgraid-S30311-1 crmd: [17529]: info: do_lrm_rsc_op: Performing key=75:3:0:48aac631-8177-4cda-94ea-48dfa9b1a90f op=SSJ30313:0_start_0 ) Apr 12 19:33:41 mgraid-S30311-1 crmd: [17529]: info: do_lrm_rsc_op: Performing key=101:3:0:48aac631-8177-4cda-94ea-48dfa9b1a90f op=SSJ30314:0_start_0 ) But then another change comes in: Apr 12 19:33:41 mgraid-S30311-1 crmd: [17529]: info: abort_transition_graph: need_abort:59 - Triggered transition abort (complete=0) : Non-status change Normally we'd recompute and keep going, but it was a(nother) replace operation, so: Apr 12 19:33:42 mgraid-S30311-1 crmd: [17529]: info: do_state_transition: State transition S_TRANSITION_ENGINE - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=do_cib_replaced ] All the time goes here: Apr 12 19:35:31 mgraid-S30311-1 crmd: [17529]: WARN: action_timer_callback: Timer popped (timeout=2, abort_level=100, complete=true) Apr 12 19:35:31 mgraid-S30311-1 crmd: [17529]: WARN: action_timer_callback: Ignoring timeout while not in transition Apr 12 19:35:31 mgraid-S30311-1 crmd: [17529]: WARN: action_timer_callback: Timer popped (timeout=2, abort_level=100, complete=true) Apr 12 19:35:31 mgraid-S30311-1 crmd: [17529]: WARN: action_timer_callback: Ignoring timeout while not in transition Apr 12 19:35:31 mgraid-S30311-1 crmd: [17529]: WARN: action_timer_callback: Timer popped (timeout=2, abort_level=100, complete=true) Apr 12 19:35:31 mgraid-S30311-1 crmd: [17529]: WARN: action_timer_callback: Ignoring timeout while not in transition Apr 12 19:35:31 mgraid-S30311-1 crmd: [17529]: WARN: action_timer_callback: Timer popped (timeout=2, abort_level=100, complete=true) Apr 12 19:35:31 mgraid-S30311-1 crmd: [17529]: WARN: action_timer_callback: Ignoring timeout while not in transition Apr 12 19:37:00 mgraid-S30311-1 crmd: [17529]: ERROR: crm_timer_popped: Integration Timer (I_INTEGRATED) just popped! but its not at all clear to me why - although certainly avoiding the election would help. Is there any chance to load all the changes at once? BS Yes. That worked. I created the configuration in a file and then did a crm configure load update filename to avoid the election Possibly the delay related to the UUID issue above, possibly it might be related to one of these two patches that went in after 1.0.9 andrew (stable-1.0)High: crmd: Make sure we always poke the FSA after a transition to clear any TE_HALT actions CS: 9187c0506fd3 On: 2010-07-07 andrew (stable-1.0)High: crmd: Reschedule the PE_START action if its not already running when we try
Re: [Pacemaker] Question regarding starting of master/slave resources and ELECTIONs
Andrew, Thanks for responding. Comments inline with Bob From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Cc: Bob Schatz bsch...@yahoo.com Sent: Tue, April 12, 2011 11:23:14 PM Subject: Re: [Pacemaker] Question regarding starting of master/slave resources and ELECTIONs On Wed, Apr 13, 2011 at 4:54 AM, Bob Schatz bsch...@yahoo.com wrote: Hi, I am running Pacemaker 1.0.9 with Heartbeat 3.0.3. I create 5 master/slave resources in /etc/ha.d/resource.d/startstop during post-start. I had no idea this was possible. Why would you do this? Bob We and I know of a couple of other companies, bundle LinuxHA/Pacemaker into an appliance. For me, when the appliance boots, it creates HA resources based on the hardware it discovers. I assumed that once POST-START was called in the startstop script and we have a DC then the cluster is up and running. I then use crm commands to create the configuration, etc. I further assumed that since we have one DC in the cluster then all crm commands which modify the configuration would be ordered even if the DC fails over to a different node. Is this incorrect? I noticed that 4 of the master/slave resources will start right away but the 5 master/slave resource seems to take a minute or so and I am only running with one node. Is this expected? Probably, if the other 4 take around a minute each to start. There is an lrmd config variable that controls how much parallelism it allows (but i forget the name). Bob It's max-children and I set it to 40 for this test to see if it would change the behavior. (/sbin/lrmadmin -p max-children 40) My configuration is below and I have also attached ha-debug. Also, what triggers a crmd election? Node up/down events and whenever someone replaces the cib (which the shell used to do a lot). Bob For my test, I only started one node so that I could avoid node up/down events. I believe the log shows the cib being replaced. Since I am using crm then I assume it must be due to crm. Do the crm_resource, etc commands also replace the cib? Would that avoid elections as a result of cibs being replaced? Thanks, Bob I seemed to have a lot of elections in the attached log. I was assuming that on a single node I would only run the election once in the beginning and then there would not be another one until a new node joined. Thanks, Bob My configuration is: node $id=856c1f72-7cd1-4906-8183-8be87eef96f2 mgraid-s30311-1 primitive SSJ30312 ocf:omneon:ss \ params ss_resource=SSJ30312 ssconf=/var/omneon/config/config.J30312 \ op monitor interval=3s role=Master timeout=7s \ op monitor interval=10s role=Slave timeout=7 \ op stop interval=0 timeout=20 \ op start interval=0 timeout=300 primitive SSJ30313 ocf:omneon:ss \ params ss_resource=SSJ30313 ssconf=/var/omneon/config/config.J30313 \ op monitor interval=3s role=Master timeout=7s \ op monitor interval=10s role=Slave timeout=7 \ op stop interval=0 timeout=20 \ op start interval=0 timeout=300 primitive SSJ30314 ocf:omneon:ss \ params ss_resource=SSJ30314 ssconf=/var/omneon/config/config.J30314 \ op monitor interval=3s role=Master timeout=7s \ op monitor interval=10s role=Slave timeout=7 \ op stop interval=0 timeout=20 \ op start interval=0 timeout=300 primitive SSJ30315 ocf:omneon:ss \ params ss_resource=SSJ30315 ssconf=/var/omneon/config/config.J30315 \ op monitor interval=3s role=Master timeout=7s \ op monitor interval=10s role=Slave timeout=7 \ op stop interval=0 timeout=20 \ op start interval=0 timeout=300 primitive SSS30311 ocf:omneon:ss \ params ss_resource=SSS30311 ssconf=/var/omneon/config/config.S30311 \ op monitor interval=3s role=Master timeout=7s \ op monitor interval=10s role=Slave timeout=7 \ op stop interval=0 timeout=20 \ op start interval=0 timeout=300 primitive icms lsb:S53icms \ op monitor interval=5s timeout=7 \ op start interval=0 timeout=5 primitive mgraid-stonith stonith:external/mgpstonith \ params hostlist=mgraid-canister \ op monitor interval=0 timeout=20s primitive omserver lsb:S49omserver \ op monitor interval=5s timeout=7 \ op start interval=0 timeout=5 ms ms-SSJ30312 SSJ30312 \ meta clone-max=2 notify=true globally-unique=false target-role=Started ms ms-SSJ30313 SSJ30313 \ meta clone-max=2 notify=true globally-unique=false target-role=Started ms ms-SSJ30314 SSJ30314 \ meta clone-max=2 notify=true globally-unique=false target-role=Started ms ms-SSJ30315 SSJ30315 \ meta clone-max=2 notify=true globally-unique=false
[Pacemaker] Clearing a resource which returned not installed from START
I am running Pacemaker 1.0.9 and Heartbeat 3.0.3. I started a resource and the agent start method returned OCF_ERR_INSTALLED. I have fixed the problem and I would like to restart the resource and I cannot get it to restart. Any ideas? Thanks, Bob The failcounts are 0 as shown below and with the crm_resource command: # crm_mon -1 -f Last updated: Wed Mar 30 19:55:39 2011 Stack: Heartbeat Current DC: mgraid-sd6661-0 (f4e5e15c-d06b-4e37-89b9-4621af05128f) - partition with quorum Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677 2 Nodes configured, unknown expected votes 5 Resources configured. Online: [ mgraid-sd6661-1 mgraid-sd6661-0 ] Clone Set: Fencing Started: [ mgraid-sd6661-1 mgraid-sd6661-0 ] Clone Set: cloneIcms Started: [ mgraid-sd6661-1 mgraid-sd6661-0 ] Clone Set: cloneOmserver Started: [ mgraid-sd6661-1 mgraid-sd6661-0 ] Master/Slave Set: ms-SSSD6661 Masters: [ mgraid-sd6661-0 ] Slaves: [ mgraid-sd6661-1 ] Master/Slave Set: ms-SSJD6662 Masters: [ mgraid-sd6661-0 ] Stopped: [ SSJD6662:0 ] Migration summary: * Node mgraid-sd6661-0: * Node mgraid-sd6661-1: Failed actions: SSJD6662:0_start_0 (node=mgraid-sd6661-1, call=27, rc=5, status=complete): not installed I have also tried to cleanup the resource with these commands: # crm_resource --resource SSJD6662:0 --cleanup --node mgraid-sd6661-1 # crm_resource --resource SSJD6662:1 --cleanup --node mgraid-sd6661-1 # crm_resource --resource SSJD6662:0 --cleanup --node mgraid-sd6661-0 # crm_resource --resource SSJD6662:1 --cleanup --node mgraid-sd6661-0 # crm_resource --resource ms-SSJD6662 --cleanup --node mgraid-sd6661-1 # crm resource start SSJD6662:0 My configuration is: node $id=856c1f72-7cd1-4906-8183-8be87eef96f2 mgraid-sd6661-1 node $id=f4e5e15c-d06b-4e37-89b9-4621af05128f mgraid-sd6661-0 primitive SSJD6662 ocf:omneon:ss \ params ss_resource=SSJD6662 ssconf=/var/omneon/config/config.JD6662 \ op monitor interval=3s role=Master timeout=7s \ op monitor interval=10s role=Slave timeout=7 \ op stop interval=0 timeout=20 \ op start interval=0 timeout=300 primitive SSSD6661 ocf:omneon:ss \ params ss_resource=SSSD6661 ssconf=/var/omneon/config/config.SD6661 \ op monitor interval=3s role=Master timeout=7s \ op monitor interval=10s role=Slave timeout=7 \ op stop interval=0 timeout=20 \ op start interval=0 timeout=300 primitive icms lsb:S53icms \ op monitor interval=5s timeout=7 \ op start interval=0 timeout=5 primitive mgraid-stonith stonith:external/mgpstonith \ params hostlist=mgraid-canister \ op monitor interval=0 timeout=20s primitive omserver lsb:S49omserver \ op monitor interval=5s timeout=7 \ op start interval=0 timeout=5 ms ms-SSJD6662 SSJD6662 \ meta clone-max=2 notify=true globally-unique=false target-role=Started ms ms-SSSD6661 SSSD6661 \ meta clone-max=2 notify=true globally-unique=false target-role=Started clone Fencing mgraid-stonith clone cloneIcms icms clone cloneOmserver omserver location ms-SSJD6662-master-w1 ms-SSJD6662 \ rule $id=ms-SSJD6662-master-w1-rule $role=master 100: #uname eq mgraid-sd6661-1 location ms-SSSD6661-master-w1 ms-SSSD6661 \ rule $id=ms-SSSD6661-master-w1-rule $role=master 100: #uname eq mgraid-sd6661-0 order orderms-SSJD6662 0: cloneIcms ms-SSJD6662 order orderms-SSSD6661 0: cloneIcms ms-SSSD6661 property $id=cib-bootstrap-options \ dc-version=1.0.9-89bd754939df5150de7cd76835f98fe90851b677 \ cluster-infrastructure=Heartbeat \ dc-deadtime=5s \ stonith-enabled=true \ last-lrm-refresh=1301536426 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg
A few more thoughts that occurred after I hit return 1. This problem sees to only occur when /etc/init.d/heartbeat start is executed on two nodes at the same time. If I only do one at a time it does not seem to occur. (this may be related to the creation of master/slave resources in /etc/ha.d/resource.d/startstop when heartbeat starts) 2. This problem seemed to occur most frequently when I went from 4 master/slave resources to 6 master/slave resources. Thanks, Bob - Original Message From: Bob Schatz bsch...@yahoo.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Fri, March 25, 2011 4:22:39 PM Subject: Re: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg After reading more threads, I noticed that I needed to include the PE outputs. Therefore, I have rerun the tests and included the PE outputs, the configuration file and the logs for both nodes. The test was rerun with max-children of 20. Thanks, Bob - Original Message From: Bob Schatz bsch...@yahoo.com To: pacemaker@oss.clusterlabs.org Sent: Thu, March 24, 2011 7:35:54 PM Subject: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg I am getting these messages in the log: 2011-03-24 18:53:12| warning |crmd: [27913]: WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg 2011-03-24 18:53:12| info |crmd: [27913]: info: msg_to_op: Message follows: 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 16 fields 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [lrm_t=op] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : [lrm_rid=SSJE02A2:0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : [lrm_op=start] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : [lrm_timeout=30] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : [lrm_interval=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : [lrm_delay=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : [lrm_copyparams=1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : [lrm_t_run=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : [lrm_t_rcchange=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : [lrm_exec_time=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : [lrm_queue_time=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : [lrm_targetrc=-1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : [lrm_app=crmd] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : [lrm_userdata=91:3:0:dc9ad1c7-1d74-4418-a002-34426b34b576] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : [(2)lrm_param=0x64c230(938 1098)] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 27 fields 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [CRM_meta_clone=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : [CRM_meta_notify_slave_resource= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : [CRM_meta_notify_active_resource= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : [CRM_meta_notify_demote_uname= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : [CRM_meta_notify_inactive_resource=SSJE02A2:0 SSJE02A2:1 ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : [ssconf=/var/omneon/config/config.JE02A2] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : [CRM_meta_master_node_max=1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : [CRM_meta_notify_stop_resource= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : [CRM_meta_notify_master_resource= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : [CRM_meta_clone_node_max=1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : [CRM_meta_clone_max=2] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : [CRM_meta_notify=true] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : [CRM_meta_notify_start_resource=SSJE02A2:0 SSJE02A2:1 ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : [CRM_meta_notify_stop_uname= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : [crm_feature_set=3.0.1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] : [CRM_meta_notify_master_uname= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[16] : [CRM_meta_master_max=1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[17] : [CRM_meta_globally_unique=false] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[18] : [CRM_meta_notify_promote_resource=SSJE02A2:0 ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[19] : [CRM_meta_notify_promote_uname=mgraid-se02a1-0 ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[20] : [CRM_meta_notify_active_uname= ] 2011-03-24 18:53:12| info
Re: [Pacemaker] Return value from promote function
Thanks Andrew! This works. - Original Message From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Thu, February 10, 2011 1:37:52 AM Subject: Re: [Pacemaker] Return value from promote function On Tue, Feb 8, 2011 at 3:42 AM, Bob Schatz bsch...@yahoo.com wrote: I am running Pacemaker 1.0.9.1 and Heartbeat 3.0.3. I have a master/slave resource with an agent. When the resource hangs while doing a promote, the resource returns OCF_ERR_GENERIC. However, all this does is call demote on the resource, restart the resource on the same node and then retry the promote again on the same node. Is there anyway I can have the CRM promote the resource on the peer node instead? Have the agent set a different promotion score with crm_master. My configuration is: node $id=856c1f72-7cd1-4906-8183-8be87eef96f2 mgraid-mkp9010repk-1 node $id=f4e5e15c-d06b-4e37-89b9-4621af05128f mgraid-mkp9010repk-0 primitive SSMKP9010REPK ocf:omneon:ss \ params ss_resource=SSMKP9010REPK ssconf=/var/omneon/config/config.MKP9010REPK \ op monitor interval=3s role=Master timeout=7s \ op monitor interval=10s role=Slave timeout=7 \ op stop interval=0 timeout=120 \ op start interval=0 timeout=600 primitive icms lsb:S53icms \ op monitor interval=5s timeout=7 \ op start interval=0 timeout=5 primitive mgraid-stonith stonith:external/mgpstonith \ params hostlist=mgraid-canister \ op monitor interval=0 timeout=20s primitive omserver lsb:S49omserver \ op monitor interval=5s timeout=7 \ op start interval=0 timeout=5 ms ms-SSMKP9010REPK SSMKP9010REPK \ meta clone-max=2 notify=true globally-unique=false target-role=Master clone Fencing mgraid-stonith clone cloneIcms icms clone cloneOmserver omserver location ms-SSMKP9010REPK-master-w1 ms-SSMKP9010REPK \ rule $id=ms-SSMKP9010REPK-master-w1-rule $role=master 100: #uname eq mgraid-mkp9010repk-0 order orderms-SSMKP9010REPK 0: cloneIcms ms-SSMKP9010REPK property $id=cib-bootstrap-options \ dc-version=1.0.9-89bd754939df5150de7cd76835f98fe90851b677 \ cluster-infrastructure=Heartbeat \ dc-deadtime=5s \ stonith-enabled=true Thanks, Bob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Return value from promote function
I am running Pacemaker 1.0.9.1 and Heartbeat 3.0.3. I have a master/slave resource with an agent. When the resource hangs while doing a promote, the resource returns OCF_ERR_GENERIC. However, all this does is call demote on the resource, restart the resource on the same node and then retry the promote again on the same node. Is there anyway I can have the CRM promote the resource on the peer node instead? My configuration is: node $id=856c1f72-7cd1-4906-8183-8be87eef96f2 mgraid-mkp9010repk-1 node $id=f4e5e15c-d06b-4e37-89b9-4621af05128f mgraid-mkp9010repk-0 primitive SSMKP9010REPK ocf:omneon:ss \ params ss_resource=SSMKP9010REPK ssconf=/var/omneon/config/config.MKP9010REPK \ op monitor interval=3s role=Master timeout=7s \ op monitor interval=10s role=Slave timeout=7 \ op stop interval=0 timeout=120 \ op start interval=0 timeout=600 primitive icms lsb:S53icms \ op monitor interval=5s timeout=7 \ op start interval=0 timeout=5 primitive mgraid-stonith stonith:external/mgpstonith \ params hostlist=mgraid-canister \ op monitor interval=0 timeout=20s primitive omserver lsb:S49omserver \ op monitor interval=5s timeout=7 \ op start interval=0 timeout=5 ms ms-SSMKP9010REPK SSMKP9010REPK \ meta clone-max=2 notify=true globally-unique=false target-role=Master clone Fencing mgraid-stonith clone cloneIcms icms clone cloneOmserver omserver location ms-SSMKP9010REPK-master-w1 ms-SSMKP9010REPK \ rule $id=ms-SSMKP9010REPK-master-w1-rule $role=master 100: #uname eq mgraid-mkp9010repk-0 order orderms-SSMKP9010REPK 0: cloneIcms ms-SSMKP9010REPK property $id=cib-bootstrap-options \ dc-version=1.0.9-89bd754939df5150de7cd76835f98fe90851b677 \ cluster-infrastructure=Heartbeat \ dc-deadtime=5s \ stonith-enabled=true Thanks, Bob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] OCF RA dev guide: final heads up
Florain, Comments below with [BS] Thanks, Bob - Original Message From: Florian Haas florian.h...@linbit.com To: pacemaker@oss.clusterlabs.org Sent: Mon, December 6, 2010 7:25:28 AM Subject: Re: [Pacemaker] OCF RA dev guide: final heads up Hello Bob, On 2010-12-03 20:12, Bob Schatz wrote: Florian, Thanks for writing this! I already found one or two errors related to return codes in my agent based on your document. :) I have not read the entire document but I do have these comments: 1.Does this document apply to all versions of the agent framework or only certain versions(hopefully all in one place)? I think the document should have a section which specifies which versions are covered. Also, if certain areas only apply to a certain version then a Note should be mentioned in the section. 2.In Section 3.8 OCF_NOT_RUNNING, how can a monitor return OCF_FAILED_MASTER? Is there an environment variable passed to the monitor action which says I think you are a master - tell me if you or are not? No, the very purpose of monitor is to _find out_ the status of the resource. If the resource can query its own master/slave status, it should do so, and then if it is both a master and failed, it should return OCF_FAILED_MASTER. [BS] Okay. That makes sense now. 3.In Section 5.3 monitor action, it would be nice if you show how a OCF_FAILED_MASTER is returned. Hm. Let me defer that for a little bit. [BS] Sounds good 4.Sections 5.8 migrate_to action and 5.9 migrate_from action, do these apply to master/slave resources also or only to primitive resources? Good question, and indeed I don't know. It's conceivable that a clone set (remember, m/s are just clones with a little extra) has a clone-max that is less than the number of nodes in the cluster, and supports migration, and therefore a clone instance should be able to live-migrate to a different node. I have no clue whether it's indeed implemented that way, though. Andrew, maybe you can shed some extra light on this? 5.Section 5.10 notify action, I think you to want to add a note/reference to the Pacemaker Configuration Explained section 10.3.3.9 Proper Interpretation of Notification Environment Variables. (Section name may be different as I was looking at 1.0 from about a year ago). Good idea. I'll put that on my to-do list. 6.Section 8.4 Specifying a master preference, starting in at least version of Pacemaker 1.0.9.1 it is possible to specify a negative master score. I think it would be good to add this to the example as well as a note about which version has this functionality since it was broken in 1.0.6. Don't you think this would just royally confuse people? [BS] You are probably right. I guess you don't want to document bugs and workarounds from past releases in the current manual. That makes sense. Florian ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] OCF RA dev guide: final heads up
Florian, Thanks for writing this! I already found one or two errors related to return codes in my agent based on your document. :) I have not read the entire document but I do have these comments: 1.Does this document apply to all versions of the agent framework or only certain versions(hopefully all in one place)? I think the document should have a section which specifies which versions are covered. Also, if certain areas only apply to a certain version then a Note should be mentioned in the section. 2.In Section 3.8 OCF_NOT_RUNNING, how can a monitor return OCF_FAILED_MASTER? Is there an environment variable passed to the monitor action which says I think you are a master - tell me if you or are not? 3.In Section 5.3 monitor action, it would be nice if you show how a OCF_FAILED_MASTER is returned. 4.Sections 5.8 migrate_to action and 5.9 migrate_from action, do these apply to master/slave resources also or only to primitive resources? 5.Section 5.10 notify action, I think you to want to add a note/reference to the Pacemaker Configuration Explained section 10.3.3.9 Proper Interpretation of Notification Environment Variables. (Section name may be different as I was looking at 1.0 from about a year ago). 6.Section 8.4 Specifying a master preference, starting in at least version of Pacemaker 1.0.9.1 it is possible to specify a negative master score. I think it would be good to add this to the example as well as a note about which version has this functionality since it was broken in 1.0.6. Thanks, Bob - Original Message From: Florian Haas florian.h...@linbit.com To: High-Availability Linux Development List linux-ha-...@lists.linux-ha.org; The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org; cluster-de...@redhat.com Sent: Fri, December 3, 2010 1:46:29 AM Subject: [Pacemaker] OCF RA dev guide: final heads up Folks, I've heard a few positive and zero negative reviews about the current OCF resource agent dev guide draft, so I intend to publish a first released version early next week. It's going to go up on the linux-ha.org web site initially, and will stay there until it finds a better home. If anyone has objections, please let me know. The current draft is here: http://people.linbit.com/~florian/ra-dev-guide/ (HTML) http://people.linbit.com/~florian/ra-dev-guide.pdf (PDF) Cheers, Florian ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] (no subject)
pLunch this week?brbrbr/p pSent from Yahoo! Mail on Android/p ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Question about fix for bug 2477
I am using 1.0.9.1 of Pacemaker. I have applied the fix for bug 2477 and it is not working for me. I started with this: # crm_mon -n -1 Last updated: Mon Nov 8 09:49:07 2010 Stack: Heartbeat Current DC: mgraid-mkp9010repk-0 (f4e5e15c-d06b-4e37-89b9-4621af05128f) - partition with quorum Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677 2 Nodes configured, unknown expected votes 4 Resources configured. Node mgraid-mkp9010repk-0 (f4e5e15c-d06b-4e37-89b9-4621af05128f): online SSMKP9010REPK:0 (ocf::omneon:ss) Master icms:0 (lsb:S53icms) Started mgraid-stonith:0(stonith:external/mgpstonith) Started omserver:0 (lsb:S49omserver) Started Node mgraid-mkp9010repk-1 (856c1f72-7cd1-4906-8183-8be87eef96f2): online omserver:1 (lsb:S49omserver) Started SSMKP9010REPK:1 (ocf::omneon:ss) Slave icms:1 (lsb:S53icms) Started mgraid-stonith:1 (stonith:external/mgpstonith) Started This is the output I received: # ./crm_resource -r ms-SSMKP9010REPK -W resource ms-SSMKP9010REPK is running on: mgraid-mkp9010repk-0 resource ms-SSMKP9010REPK is running on: mgraid-mkp9010repk-1 The bug fix adds this check: if((the_rsc-variant == pe_native) (the_rsc-role == RSC_ROLE_MASTER)) { state = Master; } fprintf(stdout, resource %s is running on: %s %s\n, rsc, node-details-uname, state); When I dump the_rsc with the debugger I see that the_rsc-variant is pe_master and not pe_native. Also, the_rsc-role is RSC_ROLE_STOPPED. This is even if I use the original crm_resource.c. The complete dump of the the_rsc structure is: (gdb) print *the_rsc $2 = {id = 0x64d260 ms-SSMKP9010REPK, clone_name = 0x0, long_name = 0x64d280 ms-SSMKP9010REPK, xml = 0x634ca0, ops_xml = 0x0, parent = 0x0, variant_opaque = 0x64d6a0, variant = pe_master, fns = 0x7f8496b67f00, cmds = 0x0, recovery_type = recovery_stop_start, restart_type = pe_restart_ignore, priority = 0, stickiness = 0, sort_index = 0, failure_timeout = 0, effective_priority = 0, migration_threshold = 100, flags = 262418, rsc_cons_lhs = 0x0, rsc_cons = 0x0, rsc_location = 0x0, actions = 0x0, allocated_to = 0x0, running_on = 0x658060, known_on = 0x0, allowed_nodes = 0x60e2c0, role = RSC_ROLE_STOPPED, next_role = RSC_ROLE_MASTER, meta = 0x648990, parameters = 0x648940, children = 0x610280} Any idea why this can happen? Is there another fix I need for 1.0.9.1 to make this change work? Thanks, Bob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Best way to find master node
Thanks - filed as 2477 Thanks, Bob - Original Message From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Wed, August 25, 2010 11:22:24 PM Subject: Re: [Pacemaker] Best way to find master node On Wed, Aug 25, 2010 at 6:39 PM, Bob Schatz bsch...@yahoo.com wrote: Yes it does. Ok. Could you create a bugzilla for this please? I'll make sure it gets fixed. Here is output from a different cluster which is at the same 1.0.9.1 Pacemaker version. # crm_mon -n -1 Last updated: Wed Aug 25 09:35:51 2010 Stack: Heartbeat Current DC: mg-wd-wcaw30021216-0 (f4e5e15c-d06b-4e37-89b9-4621af05128f) - partition with quorum Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677 2 Nodes configured, unknown expected votes 2 Resources configured. Node mg-wd-wcaw30021216-0 (f4e5e15c-d06b-4e37-89b9-4621af05128f): online SSWD-WCAW30021216:0 (ocf::omneon:ss) Master SSWD-WCAW30021767:0 (ocf::omneon:ss) Slave Node mg-wd-wcaw30021216-1 (856c1f72-7cd1-4906-8183-8be87eef96f2): online SSWD-WCAW30021767:1 (ocf::omneon:ss) Master SSWD-WCAW30021216:1 (ocf::omneon:ss) Slave [r...@mg-wd-wcaw30021216-0 ~]# crm resource show INFO: building help index Master/Slave Set: ms-SSWD-WCAW30021216 Masters: [ mg-wd-wcaw30021216-0 ] Slaves: [ mg-wd-wcaw30021216-1 ] Master/Slave Set: ms-SSWD-WCAW30021767 Masters: [ mg-wd-wcaw30021216-1 ] Slaves: [ mg-wd-wcaw30021216-0 ] [r...@mg-wd-wcaw30021216-0 ~]# crm resource status ms-SSWD-WCAW30021216 resource ms-SSWD-WCAW30021216 is running on: mg-wd-wcaw30021216-0 resource ms-SSWD-WCAW30021216 is running on: mg-wd-wcaw30021216-1 [r...@mg-wd-wcaw30021216-0 ~]# crm resource status ms-SSWD-WCAW30021767 resource ms-SSWD-WCAW30021767 is running on: mg-wd-wcaw30021216-0 resource ms-SSWD-WCAW30021767 is running on: mg-wd-wcaw30021216-1 [r...@mg-wd-wcaw30021216-0 ~]# Thanks, Bob - Original Message From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Tue, August 24, 2010 11:38:25 PM Subject: Re: [Pacemaker] Best way to find master node On Tue, Aug 24, 2010 at 12:37 AM, Bob Schatz bsch...@yahoo.com wrote: I would like to find the master node for a resource. On 1.0.9.1, when I do: # crm resource status ms-SSWD-WCAW30019072 resource ms-SSWD-WCAW30019072 is running on: box-0 resource ms-SSWD-WCAW30019072 is running on: box-1 This does not tell me if it is master or slave. Does crm_mon indicate that either have been promoted to master? I found this thread: http://www.gossamer-threads.com/lists/linuxha/pacemaker/60434?search_string=crm_resource%20master%20;#60434 4 but I could not find a bug filed. Can I file a bug for this? Would it be on crm_resource? Is there any workaround for this? Thanks, Bob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Best way to find master node
Yes it does. Here is output from a different cluster which is at the same 1.0.9.1 Pacemaker version. # crm_mon -n -1 Last updated: Wed Aug 25 09:35:51 2010 Stack: Heartbeat Current DC: mg-wd-wcaw30021216-0 (f4e5e15c-d06b-4e37-89b9-4621af05128f) - partition with quorum Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677 2 Nodes configured, unknown expected votes 2 Resources configured. Node mg-wd-wcaw30021216-0 (f4e5e15c-d06b-4e37-89b9-4621af05128f): online SSWD-WCAW30021216:0 (ocf::omneon:ss) Master SSWD-WCAW30021767:0 (ocf::omneon:ss) Slave Node mg-wd-wcaw30021216-1 (856c1f72-7cd1-4906-8183-8be87eef96f2): online SSWD-WCAW30021767:1 (ocf::omneon:ss) Master SSWD-WCAW30021216:1 (ocf::omneon:ss) Slave [r...@mg-wd-wcaw30021216-0 ~]# crm resource show INFO: building help index Master/Slave Set: ms-SSWD-WCAW30021216 Masters: [ mg-wd-wcaw30021216-0 ] Slaves: [ mg-wd-wcaw30021216-1 ] Master/Slave Set: ms-SSWD-WCAW30021767 Masters: [ mg-wd-wcaw30021216-1 ] Slaves: [ mg-wd-wcaw30021216-0 ] [r...@mg-wd-wcaw30021216-0 ~]# crm resource status ms-SSWD-WCAW30021216 resource ms-SSWD-WCAW30021216 is running on: mg-wd-wcaw30021216-0 resource ms-SSWD-WCAW30021216 is running on: mg-wd-wcaw30021216-1 [r...@mg-wd-wcaw30021216-0 ~]# crm resource status ms-SSWD-WCAW30021767 resource ms-SSWD-WCAW30021767 is running on: mg-wd-wcaw30021216-0 resource ms-SSWD-WCAW30021767 is running on: mg-wd-wcaw30021216-1 [r...@mg-wd-wcaw30021216-0 ~]# Thanks, Bob - Original Message From: Andrew Beekhof and...@beekhof.net To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Tue, August 24, 2010 11:38:25 PM Subject: Re: [Pacemaker] Best way to find master node On Tue, Aug 24, 2010 at 12:37 AM, Bob Schatz bsch...@yahoo.com wrote: I would like to find the master node for a resource. On 1.0.9.1, when I do: # crm resource status ms-SSWD-WCAW30019072 resource ms-SSWD-WCAW30019072 is running on: box-0 resource ms-SSWD-WCAW30019072 is running on: box-1 This does not tell me if it is master or slave. Does crm_mon indicate that either have been promoted to master? I found this thread: http://www.gossamer-threads.com/lists/linuxha/pacemaker/60434?search_string=crm_resource%20master%20;#60434 4 but I could not find a bug filed. Can I file a bug for this? Would it be on crm_resource? Is there any workaround for this? Thanks, Bob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Best way to find master node
I would like to find the master node for a resource. On 1.0.9.1, when I do: # crm resource status ms-SSWD-WCAW30019072 resource ms-SSWD-WCAW30019072 is running on: box-0 resource ms-SSWD-WCAW30019072 is running on: box-1 This does not tell me if it is master or slave. I found this thread: http://www.gossamer-threads.com/lists/linuxha/pacemaker/60434?search_string=crm_resource%20master%20;#60434 but I could not find a bug filed. Can I file a bug for this? Would it be on crm_resource? Is there any workaround for this? Thanks, Bob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Pacemaker 1.0.8 and -INFINITY master score
Dejan, Thanks for the quick response! Comments below with [BS] - Original Message From: Dejan Muhamedagic deja...@fastmail.fm To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Fri, August 13, 2010 3:03:49 AM Subject: Re: [Pacemaker] Pacemaker 1.0.8 and -INFINITY master score Hi, On Thu, Aug 12, 2010 at 12:54:10PM -0700, Bob Schatz wrote: I upgraded to Pacemaker 1.0.8 since my application consists of Master/Slave resources and I wanted to pick up the fix for setting negative master scores. Why not to 1.0.9.1? [BS] Because when I checked the link http://www.clusterlabs.org/wiki/Get_Pacemaker it seemed to indicated that 1.0.8 was the last one fully tested with the rest of the stack (heartbeat, etc). Are there fixes in 1.0.9 in this area? I generally am conservative about upgrading HA software until it has been out a while. :) I am now able to set negative master scores when a resources starts and is SLAVE but can't be promoted. (The reason I want this is the process needs an administrative override and has to be up and running for the administrative override.) However, when I test this on a one node cluster I see that the resource loops through the cycle (attempt promote, timeout, stop the resource, start the resource, ...). I would have thought that a master score of -INFINITY would have prevented the promotion. Yes, sounds like it. Where is the score set? Wouldn't resource demote do the right thing? [BS] I used the ~October 2009 DRBD agent as a reference. The crm_master -Q -l reboot -v -10 call is made at the end of the start() entry point in the agent. I am not sure what you mean by resource demote My configuration is: node $id=f4e5e15c-d06b-4e37-89b9-4621af05128f mgraid-bob1-0 primitive SSJ5AMKP9010REPK ocf:omneon:ss \ params ss_resource=SSJ5AMKP9010REPK ssconf=/var/omneon/config/config.J5AMKP9010REPK \ op monitor interval=3s role=Master timeout=7s \ op monitor interval=10s role=Slave timeout=7 \ op stop interval=0 timeout=100 \ op start interval=0 timeout=120 \ meta id=SSJ5AMKP9010REPK-meta_attributes ms ms-SSJ5AMKP9010REPK SSJ5AMKP9010REPK \ meta clone-max=2 notify=true globaally-unique=false You have a typo here. [BS] AH!!! I owe you at least one drink for that!!! Thanks, Bob Thanks, Dejan target-role=Master location ms-SSJ5AMKP9010REPK-master-w1 ms-SSJ5AMKP9010REPK \ rule $id=ms-SSJ5AMKP9010REPK-master-w1-rule $role=master 100: #uname eq mgraid-BOB1-0 property $id=cib-bootstrap-options \ dc-version=1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7 \ cluster-infrastructure=Heartbeat \ stonith-enabled=false Is this the expected behavior? Thanks, Bob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] crm and primitive meta id - 1.0.8 vs 1.0.9
On 1.0.6 and 1.0.8 I use to do this to create a primitive: crm configure primitive SS1 ocf:omneon:ss params ss_resource=SS1 \ ssconf=${CONFIG_FILE} op monitor interval=3s role=Master \ timeout=7s op monitor interval=10s role=Slave timeout=7 \ op stop timeout=100 op start timeout=120 \ meta id=SS1-meta_attributes However, when I do this with 1.0.9, I get this error: crm configure primitive SS1 ocf:omneon:ss params ss_resource=SS1 ssconf= op monitor interval=3s role=Master timeout=7s op monitor interval=10s role=Slave timeout=7 op stop timeout=100 op start timeout=120 meta id=SS1-meta_attributes ERROR: SS1: attribute id does not exist I have to admit that I do not know what meta id=SS1-meta_attributes does. I assume I ran about it but I cannot find the document any longer. Thanks, Bob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Pacemaker 1.0.8 and -INFINITY master score
Dejan, I tested this with 1.0.9.1. If a negative master score is set then it does not call promote. Thanks! However, I still see notifications messages as shown below but it does not appear that the actual notification entry point is called in the agent. If you agree, I will file a bug. Aug 13 17:02:20 mgraid-MGG90106M6T-0 pengine: [4463]: info: master_color: ms-SSJ5AMGG90106M6T: Promoted 0 instances of a possible 1 to master Aug 13 17:03:34 mgraid-MGG90106M6T-0 crmd: [1102]: info: te_rsc_command: Initiating action 44: notify SSJ5AMGG90106M6T:0_pre_notify_promote_0 on mgraid-mgg90106m6t-1 Aug 13 17:02:20 mgraid-MGG90106M6T-0 pengine: [4463]: ERROR: create_notification_boundaries: Creating boundaries for ms-SSJ5AMGG90106M6T Aug 13 17:02:20 mgraid-MGG90106M6T-0 pengine: [4463]: ERROR: create_notification_boundaries: Creating boundaries for ms-SSJ5AMGG90106M6T Aug 13 17:02:20 mgraid-MGG90106M6T-0 pengine: [4463]: ERROR: create_notification_boundaries: Creating boundaries for ms-SSJ5AMGG90106M6T Aug 13 17:02:20 mgraid-MGG90106M6T-0 pengine: [4463]: ERROR: create_notification_boundaries: Creating boundaries for ms-SSJ5AMGG90106M6T Aug 13 17:03:38 mgraid-MGG90106M6T-0 crmd: [1102]: info: te_rsc_command: Ini tiating action 8: promote SSJ5AMGG90106M6T:0_promote_0 on mgraid-mgg9010 6m6t-1 Aug 13 17:03:40 mgraid-MGG90106M6T-0 crmd: [1102]: info: match_graph_event: Action SSJ5AMGG90106M6T:0_promote_0 (8) confirmed on mgraid-mgg90106m6t- 1 (rc=0) Aug 13 17:03:41 mgraid-MGG90106M6T-0 crmd: [1102]: info: te_rsc_command: Ini tiating action 45: notify SSJ5AMGG90106M6T:0_post_notify_promote_0 on mgraid -mgg90106m6t-1 Aug 13 17:03:43 mgraid-MGG90106M6T-0 crmd: [1102]: info: match_graph_event: Action SSJ5AMGG90106M6T:0_post_notify_promote_0 (45) confirmed on mgraid-mgg 90106m6t-1 (rc=0) Thanks, Bob - Original Message From: Bob Schatz bsch...@yahoo.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Fri, August 13, 2010 9:25:08 AM Subject: Re: [Pacemaker] Pacemaker 1.0.8 and -INFINITY master score Dejan, Thanks for the quick response! Comments below with [BS] - Original Message From: Dejan Muhamedagic deja...@fastmail.fm To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Fri, August 13, 2010 3:03:49 AM Subject: Re: [Pacemaker] Pacemaker 1.0.8 and -INFINITY master score Hi, On Thu, Aug 12, 2010 at 12:54:10PM -0700, Bob Schatz wrote: I upgraded to Pacemaker 1.0.8 since my application consists of Master/Slave resources and I wanted to pick up the fix for setting negative master scores. Why not to 1.0.9.1? [BS] Because when I checked the link http://www.clusterlabs.org/wiki/Get_Pacemaker it seemed to indicated that 1.0.8 was the last one fully tested with the rest of the stack (heartbeat, etc). Are there fixes in 1.0.9 in this area? I generally am conservative about upgrading HA software until it has been out a while. :) I am now able to set negative master scores when a resources starts and is SLAVE but can't be promoted. (The reason I want this is the process needs an administrative override and has to be up and running for the administrative override.) However, when I test this on a one node cluster I see that the resource loops through the cycle (attempt promote, timeout, stop the resource, start the resource, ...). I would have thought that a master score of -INFINITY would have prevented the promotion. Yes, sounds like it. Where is the score set? Wouldn't resource demote do the right thing? [BS] I used the ~October 2009 DRBD agent as a reference. The crm_master -Q -l reboot -v -10 call is made at the end of the start() entry point in the agent. I am not sure what you mean by resource demote My configuration is: node $id=f4e5e15c-d06b-4e37-89b9-4621af05128f mgraid-bob1-0 primitive SSJ5AMKP9010REPK ocf:omneon:ss \ params ss_resource=SSJ5AMKP9010REPK ssconf=/var/omneon/config/config.J5AMKP9010REPK \ op monitor interval=3s role=Master timeout=7s \ op monitor interval=10s role=Slave timeout=7 \ op stop interval=0 timeout=100 \ op start interval=0 timeout=120 \ meta id=SSJ5AMKP9010REPK-meta_attributes ms ms-SSJ5AMKP9010REPK SSJ5AMKP9010REPK \ meta clone-max=2 notify=true globaally-unique=false You have a typo here. [BS] AH!!! I owe you at least one drink for that!!! Thanks, Bob Thanks, Dejan target-role=Master location ms-SSJ5AMKP9010REPK-master-w1 ms-SSJ5AMKP9010REPK \ rule $id=ms-SSJ5AMKP9010REPK-master-w1-rule $role=master 100: #uname eq mgraid-BOB1-0 property $id=cib-bootstrap-options \ dc-version=1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7 \ cluster-infrastructure=Heartbeat \ stonith-enabled=false
[Pacemaker] Pacemaker 1.0.8 and -INFINITY master score
I upgraded to Pacemaker 1.0.8 since my application consists of Master/Slave resources and I wanted to pick up the fix for setting negative master scores. I am now able to set negative master scores when a resources starts and is SLAVE but can't be promoted. (The reason I want this is the process needs an administrative override and has to be up and running for the administrative override.) However, when I test this on a one node cluster I see that the resource loops through the cycle (attempt promote, timeout, stop the resource, start the resource, ...). I would have thought that a master score of -INFINITY would have prevented the promotion. My configuration is: node $id=f4e5e15c-d06b-4e37-89b9-4621af05128f mgraid-bob1-0 primitive SSJ5AMKP9010REPK ocf:omneon:ss \ params ss_resource=SSJ5AMKP9010REPK ssconf=/var/omneon/config/config.J5AMKP9010REPK \ op monitor interval=3s role=Master timeout=7s \ op monitor interval=10s role=Slave timeout=7 \ op stop interval=0 timeout=100 \ op start interval=0 timeout=120 \ meta id=SSJ5AMKP9010REPK-meta_attributes ms ms-SSJ5AMKP9010REPK SSJ5AMKP9010REPK \ meta clone-max=2 notify=true globaally-unique=false target-role=Master location ms-SSJ5AMKP9010REPK-master-w1 ms-SSJ5AMKP9010REPK \ rule $id=ms-SSJ5AMKP9010REPK-master-w1-rule $role=master 100: #uname eq mgraid-BOB1-0 property $id=cib-bootstrap-options \ dc-version=1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7 \ cluster-infrastructure=Heartbeat \ stonith-enabled=false Is this the expected behavior? Thanks, Bob ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker