[Pacemaker] [Question] About the stop order at the time of the Probe error.
Hi All, We found a problem at the time of Porobe error. It is the following simple resource constitution. Last updated: Wed Aug 22 15:19:50 2012 Stack: Heartbeat Current DC: drbd1 (6081ac99-d941-40b9-a4a3-9f996ff291c0) - partition with quorum Version: 1.0.12-c6770b8 1 Nodes configured, unknown expected votes 1 Resources configured. Online: [ drbd1 ] Resource Group: grpTest resource1 (ocf::pacemaker:Dummy): Started drbd1 resource2 (ocf::pacemaker:Dummy): Started drbd1 resource3 (ocf::pacemaker:Dummy): Started drbd1 resource4 (ocf::pacemaker:Dummy): Started drbd1 Node Attributes: * Node drbd1: Migration summary: * Node drbd1: Depending on the resource that the Probe error occurs, the stop of the resource does not become the inverse order. I confirmed it in the next procedure. Step 1) Make resource2 and resource4 a starting state. [root@drbd1 ~]# touch /var/run/Dummy-resource2.state [root@drbd1 ~]# touch /var/run/Dummy-resource4.state Step 2) Start a node and send cib. Step 3) Resource2 and resource3 stop, but are not inverse order. (snip) Aug 22 15:19:47 drbd1 pengine: [32722]: notice: group_print: Resource Group: grpTest Aug 22 15:19:47 drbd1 pengine: [32722]: notice: native_print: resource1#011(ocf::pacemaker:Dummy):#011Stopped Aug 22 15:19:47 drbd1 pengine: [32722]: notice: native_print: resource2#011(ocf::pacemaker:Dummy):#011Started drbd1 Aug 22 15:19:47 drbd1 pengine: [32722]: notice: native_print: resource3#011(ocf::pacemaker:Dummy):#011Stopped Aug 22 15:19:47 drbd1 pengine: [32722]: notice: native_print: resource4#011(ocf::pacemaker:Dummy):#011Started drbd1 (snip) Aug 22 15:19:47 drbd1 crmd: [32719]: info: te_rsc_command: Initiating action 6: stop resource2_stop_0 on drbd1 (local) Aug 22 15:19:47 drbd1 crmd: [32719]: info: do_lrm_rsc_op: Performing key=6:2:0:5c924067-0d20-48fd-9772-88e530661270 op=resource2_stop_0 ) Aug 22 15:19:47 drbd1 lrmd: [32716]: info: rsc:resource2 stop[6] (pid 32745) Aug 22 15:19:47 drbd1 crmd: [32719]: info: te_rsc_command: Initiating action 11: stop resource4_stop_0 on drbd1 (local) Aug 22 15:19:47 drbd1 crmd: [32719]: info: do_lrm_rsc_op: Performing key=11:2:0:5c924067-0d20-48fd-9772-88e530661270 op=resource4_stop_0 ) Aug 22 15:19:47 drbd1 lrmd: [32716]: info: rsc:resource4 stop[7] (pid 32746) Aug 22 15:19:47 drbd1 lrmd: [32716]: info: operation stop[6] on resource2 for client 32719: pid 32745 exited with return code 0 (snip) I know that there is a cause of this stop order for order in group. In this case our user wants to stop a resource in inverse order definitely. * resource4_stop - resource2_stop Stop order is important to the resource of our user. I ask next question. Question 1) Is there right setting in cib.xml to evade this problem? Question 2) In Pacemaker1.1, does this problem occur? Question 3) I added following order. rsc_order id=order-2 first=resource1 then=resource3 / rsc_order id=order-3 first=resource1 then=resource4 / rsc_order id=order-5 first=resource2 then=resource4 / And the addition of this order seems to solve a problem. Is the addition of order right as one method of the solution, too? Best Regards, Hideo Yamauchi. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Unexpected resource restarts when node comes online
I can't see this happening, but I'll have a go playing with this values tomorrow. I'll let you know how it goes. Thanks Gareth On 21/08/2012 16:56, Jake Smith jsm...@argotec.com wrote: - Original Message - From: Gareth Davis gareth.da...@ipaccess.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Tuesday, August 21, 2012 11:28:53 AM Subject: Re: [Pacemaker] Unexpected resource restarts when node comes online From the documentation it seem that the default is actually interleave=trueŠwhich is I think the desired setting, i.e. Only wait for the local instance rather than all the clones. I've tried with interleave=true falseŠ doesn't seem to be cause of the problem. I'll continue with interleave=true on all clones. I've been playing around with ptest and it I think the fs1_group is being restarted, which in turn restarts NOSFileSystemCluster etc. I know it's pretty obvious but the location of your DRBD masters doesn't change between standby and online do they? Was thinking of a score problem between stickiness and placement/advisory location maybe... Jake Gareth On 21/08/2012 15:40, David Vossel dvos...@redhat.com wrote: - Original Message - From: Gareth Davis gareth.da...@ipaccess.com To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Sent: Tuesday, August 21, 2012 9:01:39 AM Subject: [Pacemaker] Unexpected resource restarts when node comes online Hi, Quick bit of back ground, I've recently updated from pacemaker 1.0 to 1.1.5 because of an issue where cloned resources be restarted unexpectedly when any of the nodes went into standby or failed (https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2153), 1.1.5 certainly fixes this issue. But now I've got is all up and running I've noticed that on returning a node from standby to online a restart of my application server is triggered. I took a quick look at your config. My guess is that the following order constraint is causing the restart of NOSServiceManager0 when the node comes back on. order order_NOSServiceManager0_after_NOSFileSystemCluster inf: NOSFileSystemCluster NOSServiceManager0 I'm thinking the interleave clone resource option might help with this. http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explai ne d/ch10s02s02.html -- Vossel I'm afraid the config is complex involving a couple of DRBD pairs, four clones, and a glassfish application server NOSServiceManager0. Output of crm configure show. https://dl.dropbox.com/u/5427964/config.txt There are 2 nodes in the cluster (oamdev-vm11 oamdev-vm12) all the non-cloned resources are running on oamdev-vm12. On putting oamdev-vm11 into standby nothing unexpected happens, but on bringing it back online causes NOSServiceManager0 to be stopped and started. crm_report output, the time span should include the standby and online events. https://dl.dropbox.com/u/5427964/pcmk-Tue-21-Aug-2012.tar.bz2 I'm at a bit of a loss as to how to debug this, I suspect I've messed up the ordering in some way, any pointers? Gareth Davis This message contains confidential information and may be privileged. If you are not the intended recipient, please notify the sender and delete the message immediately. ip.access Ltd, registration number 3400157, Building 2020, Cambourne Business Park, Cambourne, Cambridge CB23 6DW, United Kingdom ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org This message contains confidential information and may be privileged. If you are not the intended recipient, please notify the sender and delete the message immediately. ip.access Ltd, registration number 3400157, Building 2020, Cambourne Business Park, Cambourne, Cambridge CB23 6DW, United Kingdom ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
[Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?
Hello, I have a 3 node Pacemaker + Heartbeat cluster (two real nodes and 1 quorum node that cannot run resources) running on Ubuntu 12.04 Server amd64. This cluster has a DRBD resource that it mounts and then runs a KVM virtual machine from. I have configured the cluster to use ocf:pacemaker:ping with two other devices on the network (192.168.0.128, 192.168.0.129), and set constraints to move the resources to the most well-connected node (whichever node can see more of these two devices): primitive p_ping ocf:pacemaker:ping \ params name=p_ping host_list=192.168.0.128 192.168.0.129 multiplier=1000 attempts=8 debug=true \ op start interval=0 timeout=60 \ op monitor interval=10s timeout=60 ... clone cl_ping p_ping \ meta interleave=true ... location loc_run_on_most_connected g_vm \ rule $id=loc_run_on_most_connected-rule p_ping: defined p_ping Today, 192.168.0.128's network cable was unplugged for a few seconds and then plugged back in. During this time, pacemaker recognized that it could not ping 192.168.0.128 and restarted all of the resources, but left them on the same node. My understanding was that since neither node could ping 192.168.0.128 during this period, pacemaker would do nothing with the resources (leave them running). It would only migrate or restart the resources if for example node2 could ping 192.168.0.128 but node1 could not (move the resources to where things are better-connected). Is this understanding incorrect? If so, is there a way I can change my configuration so that it will only restart/migrate resources if one node is found to be better connected? Can you tell me why these resources were restarted? I have attached the syslog as well as my full CIB configuration. Thanks, Andrew Martin Aug 22 10:40:31 node1 ping[1668]: [1823]: WARNING: 192.168.0.128 is inactive: PING 192.168.0.128 (192.168.0.128) 56(84) bytes of data.#012#012--- 192.168.0.128 ping statistics ---#0128 packets transmitted, 0 received, 100% packet loss, time 7055ms Aug 22 10:40:38 node1 attrd_updater: [1860]: info: Invoked: attrd_updater -n p_ping -v 1000 -d 5s Aug 22 10:40:43 node1 attrd: [4402]: notice: attrd_trigger_update: Sending flush op to all hosts for: p_ping (1000) Aug 22 10:40:44 node1 attrd: [4402]: notice: attrd_perform_update: Sent update 265: p_ping=1000 Aug 22 10:40:44 node1 crmd: [4403]: info: abort_transition_graph: te_update_diff:164 - Triggered transition abort (complete=1, tag=nvpair, id=status-1ab0690c-5aa0-4d9c-ae4e-b662e0ca54e5-p_ping, name=p_ping, value=1000, magic=NA, cib=0.121.49) : Transient attribute: update Aug 22 10:40:44 node1 crmd: [4403]: info: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Aug 22 10:40:44 node1 crmd: [4403]: info: do_state_transition: All 3 cluster nodes are eligible to run resources. Aug 22 10:40:44 node1 crmd: [4403]: info: do_pe_invoke: Query 1023: Requesting the current CIB: S_POLICY_ENGINE Aug 22 10:40:44 node1 crmd: [4403]: info: do_pe_invoke_callback: Invoking the PE: query=1023, ref=pe_calc-dc-1345650044-1095, seq=130, quorate=1 Aug 22 10:40:44 node1 pengine: [13079]: notice: unpack_rsc_op: Hard error - p_drbd_mount1:0_last_failure_0 failed with rc=5: Preventing ms_drbd_tools from re-starting on quorum Aug 22 10:40:44 node1 pengine: [13079]: notice: unpack_rsc_op: Hard error - p_drbd_vmstore:0_last_failure_0 failed with rc=5: Preventing ms_drbd_vmstore from re-starting on quorum Aug 22 10:40:44 node1 pengine: [13079]: notice: unpack_rsc_op: Hard error - p_vm_myvm_last_failure_0 failed with rc=5: Preventing p_vm_myvm from re-starting on quorum Aug 22 10:40:44 node1 pengine: [13079]: notice: unpack_rsc_op: Hard error - p_drbd_mount2:0_last_failure_0 failed with rc=5: Preventing ms_drbd_crm from re-starting on quorum Aug 22 10:40:44 node1 pengine: [13079]: notice: unpack_rsc_op: Operation p_drbd_vmstore:0_last_failure_0 found resource p_drbd_vmstore:0 active on node1 Aug 22 10:40:44 node1 pengine: [13079]: notice: unpack_rsc_op: Operation p_drbd_mount2:0_last_failure_0 found resource p_drbd_mount2:0 active on node1 Aug 22 10:40:44 node1 pengine: [13079]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node1 Aug 22 10:40:44 node1 pengine: [13079]: notice: RecurringOp: Start recurring monitor (20s) for p_drbd_mount2:0 on node1 Aug 22 10:40:44 node1 pengine: [13079]: notice: RecurringOp: Start recurring monitor (10s) for p_drbd_mount2:1 on node2 Aug 22 10:40:44 node1 pengine: [13079]: notice: RecurringOp: Start recurring monitor (20s) for p_drbd_mount2:0 on node1 Aug 22 10:40:44 node1 pengine: [13079]: notice: RecurringOp: Start recurring monitor (10s) for p_drbd_mount2:1 on node2 Aug 22 10:40:44 node1 pengine: [13079]: notice: RecurringOp: Start recurring monitor (20s) for p_drbd_mount1:0 on node1 Aug 22 10:40:44 node1 pengine: [13079]: notice: RecurringOp: Start recurring
[Pacemaker] Issues with HA cluster for mysqld
Hello, I'm trying to set up a 2-node, active-passive HA cluster for MySQL using heartbeat and Pacemaker. The operating system is Debian Linux 6.0.5 64-bit, and I am using the heartbeat packages installed via apt-get. The servers involved are the SQL nodes of a running MySQL cluster, so the only service I need HA for is the MySQL daemon (mysqld). What I would like to do is have a single virtual IP address which clients use to query MySQL, and have the IP and mysqld fail over to the passive node in the event of a failure on the active node. I have read through a lot of the heartbeat and Pacemaker documentation, and here are the resources I have configured for the cluster: * A custom LSB script for mysqld (compliant with Pacemaker's requirements as outlined in the documentation) * An iLO2-based STONITH device using riloe (both servers are HP Proliant DL380 G5) * A virtual IP address for mysqld using IPaddr2 I believe I have configured everything correctly, but I'm not positive. Anyway, when I start heartbeat and pacemaker (/etc/init.d/heartbeat start), everything seems to be ok. However, the virtual IP never comes up, and the output of crm_resource -LV indicates that something is wrong: root@ha1:~# crm_resource -LV crm_resource[28988]: 2012/08/22_14:41:23 WARN: unpack_rsc_op: Processing failed op stonith_start_0 on ha1: unknown error (1) stonith(stonith:external/riloe) Started MysqlIP(ocf::heartbeat:IPaddr2) Stopped mysqld (lsb:mysqld) Started When I attempt to stop heartbeat and Pacemaker (/etc/init.d/heartbeat stop) it says Stopping High-Availability services: and then hangs for about 5 minutes before finally stopping the services. So, I'm left with a couple of questions. Is there something wrong with my configuration? Is there a reason why the HA services can't shut down in a timely manner? Is there something else I need to do to get the virtual IP working? Thanks in advance for any help! - Dave P.S. My full config as reported by cibadmin --query is as follows (iLO2 password removed): cib validate-with=pacemaker-1.0 crm_feature_set=3.0.1 have-quorum=1 admin_epoch=0 epoch=26 num_updates=8 cib-last-written=Wed Aug 22 11:16:59 2012 dc-uuid=1b48f410-44d1-4e89-8b52-ff23b32db1bc configuration crm_config cluster_property_set id=cib-bootstrap-options nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b/ nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=Heartbeat/ nvpair id=cib-bootstrap-options-stonith-enabled name=stonith-enabled value=true/ /cluster_property_set /crm_config nodes node id=1b48f410-44d1-4e89-8b52-ff23b32db1bc uname=ha1 type=normal/ node id=9790fe6e-67b2-4817-abf4-966b5aa6948c uname=ha2 type=normal/ /nodes resources primitive class=stonith id=stonith type=external/riloe instance_attributes id=stonith-instance_attributes nvpair id=stonith-instance_attributes-hostlist name=hostlist value=ha2/ nvpair id=stonith-instance_attributes-ilo_hostname name=ilo_hostname value=10.0.1.112/ nvpair id=stonith-instance_attributes-ilo_user name=ilo_user value=Administrator/ nvpair id=stonith-instance_attributes-ilo_password name=ilo_password value=/ nvpair id=stonith-instance_attributes-ilo_can_reset name=ilo_can_reset value=1/ nvpair id=stonith-instance_attributes-ilo_protocol name=ilo_protocol value=2/ nvpair id=stonith-instance_attributes-ilo_powerdown_method name=ilo_powerdown_method value=button/ /instance_attributes /primitive primitive class=ocf id=MysqlIP provider=heartbeat type=IPaddr2 instance_attributes id=MysqlIP-instance_attributes nvpair id=MysqlIP-instance_attributes-ip name=ip value=192.168.25.9/ nvpair id=MysqlIP-instance_attributes-cidr_netmask name=cidr_netmask value=32/ /instance_attributes operations op id=MysqlIP-monitor-30s interval=30s name=monitor/ /operations /primitive primitive id=mysqld class=lsb type=mysqld /primitive /resources constraints/ rsc_defaults/ op_defaults/ /configuration status node_state id=1b48f410-44d1-4e89-8b52-ff23b32db1bc uname=ha1 ha=active in_ccm=true crmd=online join=member expected=member crm-debug-origin=do_update_resource shutdown=0 lrm id=1b48f410-44d1-4e89-8b52-ff23b32db1bc lrm_resources lrm_resource id=stonith type=external/riloe class=stonith lrm_rsc_op id=stonith_monitor_0 operation=monitor crm-debug-origin=do_update_resource crm_feature_set=3.0.1 transition-key=4:0:7:c09f049e-ed06-4d25-bc48-143a70b97e44 transition-magic=0:7;4:0:7:c09f049e-ed06-4d25-bc48-143a70b97e44 call-id=2 rc-code=7 op-status=0 interval=0 last-run=1345660607 last-rc-change=1345660607 exec-time=0 queue-time=0 op-digest=c9a588fa10b441aa64c0a83229e8f3e1/ lrm_rsc_op id=stonith_start_0 operation=start crm-debug-origin=do_update_resource crm_feature_set=3.0.1 transition-key=4:2:0:c09f049e-ed06-4d25-bc48-143a70b97e44 transition-magic=0:1;4:2:0:c09f049e-ed06-4d25-bc48-143a70b97e44
Re: [Pacemaker] stdout reserved word
Can you show us the gcc command line and the errors that are produced? On Wed, Aug 22, 2012 at 3:20 AM, Grüninger, Andreas (LGL Extern) andreas.gruenin...@lgl.bwl.de wrote: Hello I compiled Pacemaker 1.1.7 with gcc 4.5 in Solaris 11 and with gcc 4.6 in OpenIndiana 151a6. I had to change the following: perl -pi -e 's#stdout#stdoutx#' include/crm/stonith-ng-internal.h perl -pi -e 's#stdout#stdoutx#' lib/fencing/st_client.c perl -pi -e 's#stdout#stdoutx#' fencing/commands.c Do I miss a compiler flag to accept stdout as a variable name? In tools/crm_mon.c I added the line with the '+' because sighandler_t is not defined. #if CURSES_ENABLED + typedef void (*sighandler_t)(int); static sighandler_t ncurses_winch_handler; Also the test for a compatible printw function fails with a error message like your ncurses is too old, we need 5.4. The version of the installed ncurses library is 5.7. I added between autogen.sh and gmake this to include/config.h which invalidate the result of the erroneously failing test: echo '#undef HAVE_INCOMPATIBLE_PRINTW'include/config.h crm and crm_mon work without any problem. Should I send patches via the list or report a bug? Andreas ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org