[Pacemaker] [Question] About the stop order at the time of the Probe error.

2012-08-22 Thread renayama19661014
Hi All,

We found a problem at the time of Porobe error.

It is the following simple resource constitution.


Last updated: Wed Aug 22 15:19:50 2012
Stack: Heartbeat
Current DC: drbd1 (6081ac99-d941-40b9-a4a3-9f996ff291c0) - partition with quorum
Version: 1.0.12-c6770b8
1 Nodes configured, unknown expected votes
1 Resources configured.


Online: [ drbd1 ]

 Resource Group: grpTest
 resource1  (ocf::pacemaker:Dummy): Started drbd1
 resource2  (ocf::pacemaker:Dummy): Started drbd1
 resource3  (ocf::pacemaker:Dummy): Started drbd1
 resource4  (ocf::pacemaker:Dummy): Started drbd1

Node Attributes:
* Node drbd1:

Migration summary:
* Node drbd1: 


Depending on the resource that the Probe error occurs, the stop of the resource 
does not become the inverse order.

I confirmed it in the next procedure.

Step 1) Make resource2 and resource4 a starting state.

[root@drbd1 ~]# touch /var/run/Dummy-resource2.state
[root@drbd1 ~]# touch /var/run/Dummy-resource4.state

Step 2) Start a node and send cib.

Step 3) Resource2 and resource3 stop, but are not inverse order.

(snip)
Aug 22 15:19:47 drbd1 pengine: [32722]: notice: group_print:  Resource Group: 
grpTest
Aug 22 15:19:47 drbd1 pengine: [32722]: notice: native_print:  
resource1#011(ocf::pacemaker:Dummy):#011Stopped 
Aug 22 15:19:47 drbd1 pengine: [32722]: notice: native_print:  
resource2#011(ocf::pacemaker:Dummy):#011Started drbd1
Aug 22 15:19:47 drbd1 pengine: [32722]: notice: native_print:  
resource3#011(ocf::pacemaker:Dummy):#011Stopped 
Aug 22 15:19:47 drbd1 pengine: [32722]: notice: native_print:  
resource4#011(ocf::pacemaker:Dummy):#011Started drbd1
(snip)
Aug 22 15:19:47 drbd1 crmd: [32719]: info: te_rsc_command: Initiating action 6: 
stop resource2_stop_0 on drbd1 (local)
Aug 22 15:19:47 drbd1 crmd: [32719]: info: do_lrm_rsc_op: Performing 
key=6:2:0:5c924067-0d20-48fd-9772-88e530661270 op=resource2_stop_0 )
Aug 22 15:19:47 drbd1 lrmd: [32716]: info: rsc:resource2 stop[6] (pid 32745)
Aug 22 15:19:47 drbd1 crmd: [32719]: info: te_rsc_command: Initiating action 
11: stop resource4_stop_0 on drbd1 (local)
Aug 22 15:19:47 drbd1 crmd: [32719]: info: do_lrm_rsc_op: Performing 
key=11:2:0:5c924067-0d20-48fd-9772-88e530661270 op=resource4_stop_0 )
Aug 22 15:19:47 drbd1 lrmd: [32716]: info: rsc:resource4 stop[7] (pid 32746)
Aug 22 15:19:47 drbd1 lrmd: [32716]: info: operation stop[6] on resource2 for 
client 32719: pid 32745 exited with return code 0
(snip)


I know that there is a cause of this stop order for order in group.

In this case our user wants to stop a resource in inverse order definitely.

 * resource4_stop - resource2_stop

Stop order is important to the resource of our user.


I ask next question.

Question 1) Is there right setting in cib.xml to evade this problem?

Question 2) In Pacemaker1.1, does this problem occur?

Question 3) I added following order.


rsc_order id=order-2 first=resource1 then=resource3 /
rsc_order id=order-3 first=resource1 then=resource4 /
rsc_order id=order-5 first=resource2 then=resource4 /

And the addition of this order seems to solve a problem.
Is the addition of order right as one method of the solution, too?


Best Regards,
Hideo Yamauchi.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Unexpected resource restarts when node comes online

2012-08-22 Thread Gareth Davis
I can't see this happening, but I'll have a go playing with this values
tomorrow.

I'll let you know how it goes.

Thanks
Gareth

On 21/08/2012 16:56, Jake Smith jsm...@argotec.com wrote:




- Original Message -
 From: Gareth Davis gareth.da...@ipaccess.com
 To: The Pacemaker cluster resource manager
pacemaker@oss.clusterlabs.org
 Sent: Tuesday, August 21, 2012 11:28:53 AM
 Subject: Re: [Pacemaker] Unexpected resource restarts when node comes
online
 
 From the documentation it seem that the default is actually
 interleave=trueŠwhich is I think the desired setting, i.e. Only wait
 for
 the local instance rather than all the clones. I've tried with
 interleave=true  falseŠ doesn't seem to be cause of the problem.
 
 I'll continue with interleave=true on all clones.
 
 I've been playing around with ptest and it I think the fs1_group is
 being
 restarted, which in turn restarts NOSFileSystemCluster etc.
 

I know it's pretty obvious but the location of your DRBD masters doesn't
change between standby and online do they?

Was thinking of a score problem between stickiness and placement/advisory
location maybe...

Jake

 Gareth
 
 On 21/08/2012 15:40, David Vossel dvos...@redhat.com wrote:
 
 - Original Message -
  From: Gareth Davis gareth.da...@ipaccess.com
  To: The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
  Sent: Tuesday, August 21, 2012 9:01:39 AM
  Subject: [Pacemaker] Unexpected resource restarts when node comes
  online
  
  Hi,
  
  Quick bit of back ground, I've recently updated from pacemaker 1.0
  to
  1.1.5 because of an issue where cloned resources be restarted
  unexpectedly
  when any of the nodes went into standby or failed
  (https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2153),
  1.1.5
  certainly fixes this issue.
  
  But now I've got is all up and running I've noticed that on
  returning
  a
  node from standby to online a restart of my application server is
  triggered.
 
 I took a quick look at your config.  My guess is that the following
 order
 constraint is causing the restart of NOSServiceManager0 when the
 node
 comes back on.
 
 order order_NOSServiceManager0_after_NOSFileSystemCluster inf:
 NOSFileSystemCluster NOSServiceManager0
 
 I'm thinking the interleave clone resource option might help with
 this.
 
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explai
ne
 d/ch10s02s02.html
 
 -- Vossel
 
  I'm afraid the config is complex involving a couple of DRBD pairs,
  four
  clones, and a glassfish application server NOSServiceManager0.
  
  Output of crm configure show.
  https://dl.dropbox.com/u/5427964/config.txt
  
  
  There are 2 nodes in the cluster (oamdev-vm11  oamdev-vm12) all
  the
  non-cloned resources are running on oamdev-vm12.
  
  On putting oamdev-vm11 into standby nothing unexpected happens,
  but
  on
  bringing it back online causes NOSServiceManager0 to be stopped
  and
  started.
  
  crm_report output, the time span should include the standby and
  online
  events.
  https://dl.dropbox.com/u/5427964/pcmk-Tue-21-Aug-2012.tar.bz2
  
  I'm at a bit of a loss as to how to debug this, I suspect I've
  messed
  up
  the ordering in some way, any pointers?
  
  Gareth Davis
  
  
  
  
  
  
  This message contains confidential information and may be
  privileged.
  If you are not the intended recipient, please notify the sender
  and
  delete the message immediately.
  
  ip.access Ltd, registration number 3400157, Building 2020,
  Cambourne Business Park, Cambourne, Cambridge CB23 6DW, United
  Kingdom
  
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started:
  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
  
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 
 
 
 This message contains confidential information and may be privileged.
 If you are not the intended recipient, please notify the sender and
 delete the message immediately.
 
 ip.access Ltd, registration number 3400157, Building 2020,
 Cambourne Business Park, Cambourne, Cambridge CB23 6DW, United
 Kingdom
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org

[Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

2012-08-22 Thread Andrew Martin
Hello, 


I have a 3 node Pacemaker + Heartbeat cluster (two real nodes and 1 quorum node 
that cannot run resources) running on Ubuntu 12.04 Server amd64. This cluster 
has a DRBD resource that it mounts and then runs a KVM virtual machine from. I 
have configured the cluster to use ocf:pacemaker:ping with two other devices on 
the network (192.168.0.128, 192.168.0.129), and set constraints to move the 
resources to the most well-connected node (whichever node can see more of these 
two devices): 

primitive p_ping ocf:pacemaker:ping \ 
params name=p_ping host_list=192.168.0.128 192.168.0.129 multiplier=1000 
attempts=8 debug=true \ 
op start interval=0 timeout=60 \ 
op monitor interval=10s timeout=60 
... 

clone cl_ping p_ping \ 
meta interleave=true 

... 
location loc_run_on_most_connected g_vm \ 
rule $id=loc_run_on_most_connected-rule p_ping: defined p_ping 


Today, 192.168.0.128's network cable was unplugged for a few seconds and then 
plugged back in. During this time, pacemaker recognized that it could not ping 
192.168.0.128 and restarted all of the resources, but left them on the same 
node. My understanding was that since neither node could ping 192.168.0.128 
during this period, pacemaker would do nothing with the resources (leave them 
running). It would only migrate or restart the resources if for example node2 
could ping 192.168.0.128 but node1 could not (move the resources to where 
things are better-connected). Is this understanding incorrect? If so, is there 
a way I can change my configuration so that it will only restart/migrate 
resources if one node is found to be better connected? 

Can you tell me why these resources were restarted? I have attached the syslog 
as well as my full CIB configuration. 

Thanks, 

Andrew Martin Aug 22 10:40:31 node1 ping[1668]: [1823]: WARNING: 192.168.0.128 is inactive: PING 192.168.0.128 (192.168.0.128) 56(84) bytes of data.#012#012--- 192.168.0.128 ping statistics ---#0128 packets transmitted, 0 received, 100% packet loss, time 7055ms
Aug 22 10:40:38 node1 attrd_updater: [1860]: info: Invoked: attrd_updater -n p_ping -v 1000 -d 5s 
Aug 22 10:40:43 node1 attrd: [4402]: notice: attrd_trigger_update: Sending flush op to all hosts for: p_ping (1000)
Aug 22 10:40:44 node1 attrd: [4402]: notice: attrd_perform_update: Sent update 265: p_ping=1000
Aug 22 10:40:44 node1 crmd: [4403]: info: abort_transition_graph: te_update_diff:164 - Triggered transition abort (complete=1, tag=nvpair, id=status-1ab0690c-5aa0-4d9c-ae4e-b662e0ca54e5-p_ping, name=p_ping, value=1000, magic=NA, cib=0.121.49) : Transient attribute: update
Aug 22 10:40:44 node1 crmd: [4403]: info: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Aug 22 10:40:44 node1 crmd: [4403]: info: do_state_transition: All 3 cluster nodes are eligible to run resources.
Aug 22 10:40:44 node1 crmd: [4403]: info: do_pe_invoke: Query 1023: Requesting the current CIB: S_POLICY_ENGINE
Aug 22 10:40:44 node1 crmd: [4403]: info: do_pe_invoke_callback: Invoking the PE: query=1023, ref=pe_calc-dc-1345650044-1095, seq=130, quorate=1
Aug 22 10:40:44 node1 pengine: [13079]: notice: unpack_rsc_op: Hard error - p_drbd_mount1:0_last_failure_0 failed with rc=5: Preventing ms_drbd_tools from re-starting on quorum
Aug 22 10:40:44 node1 pengine: [13079]: notice: unpack_rsc_op: Hard error - p_drbd_vmstore:0_last_failure_0 failed with rc=5: Preventing ms_drbd_vmstore from re-starting on quorum
Aug 22 10:40:44 node1 pengine: [13079]: notice: unpack_rsc_op: Hard error - p_vm_myvm_last_failure_0 failed with rc=5: Preventing p_vm_myvm from re-starting on quorum
Aug 22 10:40:44 node1 pengine: [13079]: notice: unpack_rsc_op: Hard error - p_drbd_mount2:0_last_failure_0 failed with rc=5: Preventing ms_drbd_crm from re-starting on quorum
Aug 22 10:40:44 node1 pengine: [13079]: notice: unpack_rsc_op: Operation p_drbd_vmstore:0_last_failure_0 found resource p_drbd_vmstore:0 active on node1
Aug 22 10:40:44 node1 pengine: [13079]: notice: unpack_rsc_op: Operation p_drbd_mount2:0_last_failure_0 found resource p_drbd_mount2:0 active on node1
Aug 22 10:40:44 node1 pengine: [13079]: notice: unpack_rsc_op: Operation p_drbd_mount1:0_last_failure_0 found resource p_drbd_mount1:0 active on node1
Aug 22 10:40:44 node1 pengine: [13079]: notice: RecurringOp:  Start recurring monitor (20s) for p_drbd_mount2:0 on node1
Aug 22 10:40:44 node1 pengine: [13079]: notice: RecurringOp:  Start recurring monitor (10s) for p_drbd_mount2:1 on node2
Aug 22 10:40:44 node1 pengine: [13079]: notice: RecurringOp:  Start recurring monitor (20s) for p_drbd_mount2:0 on node1
Aug 22 10:40:44 node1 pengine: [13079]: notice: RecurringOp:  Start recurring monitor (10s) for p_drbd_mount2:1 on node2
Aug 22 10:40:44 node1 pengine: [13079]: notice: RecurringOp:  Start recurring monitor (20s) for p_drbd_mount1:0 on node1
Aug 22 10:40:44 node1 pengine: [13079]: notice: RecurringOp:  Start recurring 

[Pacemaker] Issues with HA cluster for mysqld

2012-08-22 Thread David Parker

Hello,

I'm trying to set up a 2-node, active-passive HA cluster for MySQL using 
heartbeat and Pacemaker.  The operating system is Debian Linux 6.0.5 
64-bit, and I am using the heartbeat packages installed via apt-get.  
The servers involved are the SQL nodes of a running MySQL cluster, so 
the only service I need HA for is the MySQL daemon (mysqld).


What I would like to do is have a single virtual IP address which 
clients use to query MySQL, and have the IP and mysqld fail over to the 
passive node in the event of a failure on the active node.  I have read 
through a lot of the heartbeat and Pacemaker documentation, and here are 
the resources I have configured for the cluster:


* A custom LSB script for mysqld (compliant with Pacemaker's 
requirements as outlined in the documentation)
* An iLO2-based STONITH device using riloe (both servers are HP Proliant 
DL380 G5)

* A virtual IP address for mysqld using IPaddr2

I believe I have configured everything correctly, but I'm not positive.  
Anyway, when I start heartbeat and pacemaker (/etc/init.d/heartbeat 
start), everything seems to be ok.  However, the virtual IP never comes 
up, and the output of crm_resource -LV indicates that something is wrong:


root@ha1:~# crm_resource -LV
crm_resource[28988]: 2012/08/22_14:41:23 WARN: unpack_rsc_op: Processing 
failed op stonith_start_0 on ha1: unknown error (1)

 stonith(stonith:external/riloe) Started
 MysqlIP(ocf::heartbeat:IPaddr2) Stopped
 mysqld (lsb:mysqld) Started

When I attempt to stop heartbeat and Pacemaker (/etc/init.d/heartbeat 
stop) it says Stopping High-Availability services: and then hangs for 
about 5 minutes before finally stopping the services.


So, I'm left with a couple of questions.  Is there something wrong with 
my configuration?  Is there a reason why the HA services can't shut down 
in a timely manner?  Is there something else I need to do to get the 
virtual IP working?  Thanks in advance for any help!


- Dave

P.S.  My full config as reported by cibadmin --query is as follows 
(iLO2 password removed):


cib validate-with=pacemaker-1.0 crm_feature_set=3.0.1 
have-quorum=1 admin_epoch=0 epoch=26 num_updates=8 
cib-last-written=Wed Aug 22 11:16:59 2012 
dc-uuid=1b48f410-44d1-4e89-8b52-ff23b32db1bc

configuration
crm_config
cluster_property_set id=cib-bootstrap-options
nvpair id=cib-bootstrap-options-dc-version name=dc-version 
value=1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b/
nvpair id=cib-bootstrap-options-cluster-infrastructure 
name=cluster-infrastructure value=Heartbeat/
nvpair id=cib-bootstrap-options-stonith-enabled 
name=stonith-enabled value=true/

/cluster_property_set
/crm_config
nodes
node id=1b48f410-44d1-4e89-8b52-ff23b32db1bc uname=ha1 type=normal/
node id=9790fe6e-67b2-4817-abf4-966b5aa6948c uname=ha2 type=normal/
/nodes
resources
primitive class=stonith id=stonith type=external/riloe
instance_attributes id=stonith-instance_attributes
nvpair id=stonith-instance_attributes-hostlist name=hostlist 
value=ha2/
nvpair id=stonith-instance_attributes-ilo_hostname 
name=ilo_hostname value=10.0.1.112/
nvpair id=stonith-instance_attributes-ilo_user name=ilo_user 
value=Administrator/
nvpair id=stonith-instance_attributes-ilo_password 
name=ilo_password value=/
nvpair id=stonith-instance_attributes-ilo_can_reset 
name=ilo_can_reset value=1/
nvpair id=stonith-instance_attributes-ilo_protocol 
name=ilo_protocol value=2/
nvpair id=stonith-instance_attributes-ilo_powerdown_method 
name=ilo_powerdown_method value=button/

/instance_attributes
/primitive
primitive class=ocf id=MysqlIP provider=heartbeat type=IPaddr2
instance_attributes id=MysqlIP-instance_attributes
nvpair id=MysqlIP-instance_attributes-ip name=ip value=192.168.25.9/
nvpair id=MysqlIP-instance_attributes-cidr_netmask 
name=cidr_netmask value=32/

/instance_attributes
operations
op id=MysqlIP-monitor-30s interval=30s name=monitor/
/operations
/primitive
primitive id=mysqld class=lsb type=mysqld
/primitive
/resources
constraints/
rsc_defaults/
op_defaults/
/configuration
status
node_state id=1b48f410-44d1-4e89-8b52-ff23b32db1bc uname=ha1 
ha=active in_ccm=true crmd=online join=member expected=member 
crm-debug-origin=do_update_resource shutdown=0

lrm id=1b48f410-44d1-4e89-8b52-ff23b32db1bc
lrm_resources
lrm_resource id=stonith type=external/riloe class=stonith
lrm_rsc_op id=stonith_monitor_0 operation=monitor 
crm-debug-origin=do_update_resource crm_feature_set=3.0.1 
transition-key=4:0:7:c09f049e-ed06-4d25-bc48-143a70b97e44 
transition-magic=0:7;4:0:7:c09f049e-ed06-4d25-bc48-143a70b97e44 
call-id=2 rc-code=7 op-status=0 interval=0 last-run=1345660607 
last-rc-change=1345660607 exec-time=0 queue-time=0 
op-digest=c9a588fa10b441aa64c0a83229e8f3e1/
lrm_rsc_op id=stonith_start_0 operation=start 
crm-debug-origin=do_update_resource crm_feature_set=3.0.1 
transition-key=4:2:0:c09f049e-ed06-4d25-bc48-143a70b97e44 
transition-magic=0:1;4:2:0:c09f049e-ed06-4d25-bc48-143a70b97e44 

Re: [Pacemaker] stdout reserved word

2012-08-22 Thread Andrew Beekhof
Can you show us the gcc command line and the errors that are produced?

On Wed, Aug 22, 2012 at 3:20 AM, Grüninger, Andreas (LGL Extern)
andreas.gruenin...@lgl.bwl.de wrote:
 Hello

 I compiled Pacemaker 1.1.7 with gcc 4.5 in Solaris 11 and with gcc 4.6 in 
 OpenIndiana 151a6.
 I had to change the following:
 perl -pi -e 's#stdout#stdoutx#'   include/crm/stonith-ng-internal.h
 perl -pi -e 's#stdout#stdoutx#' lib/fencing/st_client.c
 perl -pi -e 's#stdout#stdoutx#' fencing/commands.c

 Do I miss a compiler flag to accept stdout as a variable name?

 In tools/crm_mon.c I added the line with the '+' because sighandler_t is not 
 defined.
 #if CURSES_ENABLED
 + typedef void (*sighandler_t)(int);
 static sighandler_t ncurses_winch_handler;

 Also the test for a compatible printw function fails with a error message 
 like your ncurses is too old, we need 5.4.
 The version of the installed ncurses library is 5.7.
 I added between autogen.sh and gmake this to include/config.h which 
 invalidate the result of the erroneously failing test:
 echo '#undef HAVE_INCOMPATIBLE_PRINTW'include/config.h

 crm and crm_mon work without any problem.

 Should I send patches via the list or report a bug?

 Andreas


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org