Re: [Pacemaker] mysql resource can't start

2013-05-09 Thread Li, Chen
For addition,
I have already tried to start resource in crm shell.
Nothing happened, and there is no log in mysql.

Thanks.
-Chen



在 2013-5-9,17:10,Li, Chen chen...@intel.commailto:chen...@intel.com 写道:

Hi list,

I’m a new user to pacemaker.
I’m trying to using pacemaker to set up a HA for mysql in active/slave mode.
My configuration is:

node api01
node api02
primitive p_drbd_mysql ocf:linbit:drbd \
params drbd_resource=mysql \
op start interval=0 timeout=90s \
op stop interval=0 timeout=180s \
op promote interval=0 timeout=180s \
op demote interval=0 timeout=180s \
op monitor interval=30s role=Slave \
op monitor interval=29s role=Master
primitive p_fs_mysql ocf:heartbeat:Filesystem \
params device=/dev/drbd/by-res/mysql directory=/var/lib/mysql 
fstype=ext4 options=relatime \
op start interval=0 timeout=60s \
op stop interval=0 timeout=180s \
op monitor interval=60s timeout=60s
primitive p_ip_mysql ocf:heartbeat:IPaddr2 \
params ip=192.168.11.13 cidr_netmask=16 \
op monitor interval=30s \
meta is-managed=true
primitive p_mysql ocf:heartbeat:mysql \
params additional_parameters=--bind-address=192.168.11.13 
config=/etc/mysql/my.cnf pid=/var/run/mysqld/mysqld.pid 
socket=/var/run/mysqld/mysqld.sock log=/var/log/mysql/mysqld.log \
op monitor interval=20s timeout=10s \
op start interval=0 timeout=120s \
op stop interval=0 timeout=120s \
meta is-managed=true
group g_mysql p_ip_mysql p_fs_mysql p_mysql \
meta target-role=Started is-managed=true
ms ms_drbd_mysql p_drbd_mysql \
meta notify=true clone-max=2
colocation c_mysql_on_drbd inf: g_mysql ms_drbd_mysql:Master
order o_drbd_before_mysql inf: ms_drbd_mysql:promote g_mysql:start
property $id=cib-bootstrap-options \
dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \
cluster-infrastructure=openais \
expected-quorum-votes=2 \
stonith-enabled=false \
no-quorum-policy=ignore \
maintenance-mode=true \
last-lrm-refresh=1368118902 \
stop-all-resources=false



But, the resource p_mysql  never start. And all resource have a “unmanaged” 
mark in status:

crm resource status
Resource Group: g_mysql
 p_ip_mysql (ocf::heartbeat:IPaddr2) Started  (unmanaged)
 p_fs_mysql (ocf::heartbeat:Filesystem) Started  (unmanaged)
 p_mysql(ocf::heartbeat:mysql) Stopped  (unmanaged)
Master/Slave Set: ms_drbd_mysql [p_drbd_mysql] (unmanaged)
 p_drbd_mysql:0 (ocf::linbit:drbd) Master  (unmanaged)
 p_drbd_mysql:1 (ocf::linbit:drbd) Slave  (unmanaged)
Anyone can help me ?


Thanks.
-chen




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] mysql resource can't start

2013-05-09 Thread emmanuel segura
Hello Li

Maybe you have all resource in unmanaged state, because you set
maintenance-mode=true


2013/5/9 Li, Chen chen...@intel.com

 For addition,
 I have already tried to start resource in crm shell.
 Nothing happened, and there is no log in mysql.

 Thanks.
 -Chen



 在 2013-5-9,17:10,Li, Chen chen...@intel.commailto:chen...@intel.com
 写道:

 Hi list,

 I’m a new user to pacemaker.
 I’m trying to using pacemaker to set up a HA for mysql in active/slave
 mode.
 My configuration is:

 node api01
 node api02
 primitive p_drbd_mysql ocf:linbit:drbd \
 params drbd_resource=mysql \
 op start interval=0 timeout=90s \
 op stop interval=0 timeout=180s \
 op promote interval=0 timeout=180s \
 op demote interval=0 timeout=180s \
 op monitor interval=30s role=Slave \
 op monitor interval=29s role=Master
 primitive p_fs_mysql ocf:heartbeat:Filesystem \
 params device=/dev/drbd/by-res/mysql directory=/var/lib/mysql
 fstype=ext4 options=relatime \
 op start interval=0 timeout=60s \
 op stop interval=0 timeout=180s \
 op monitor interval=60s timeout=60s
 primitive p_ip_mysql ocf:heartbeat:IPaddr2 \
 params ip=192.168.11.13 cidr_netmask=16 \
 op monitor interval=30s \
 meta is-managed=true
 primitive p_mysql ocf:heartbeat:mysql \
 params additional_parameters=--bind-address=192.168.11.13
 config=/etc/mysql/my.cnf pid=/var/run/mysqld/mysqld.pid
 socket=/var/run/mysqld/mysqld.sock log=/var/log/mysql/mysqld.log \
 op monitor interval=20s timeout=10s \
 op start interval=0 timeout=120s \
 op stop interval=0 timeout=120s \
 meta is-managed=true
 group g_mysql p_ip_mysql p_fs_mysql p_mysql \
 meta target-role=Started is-managed=true
 ms ms_drbd_mysql p_drbd_mysql \
 meta notify=true clone-max=2
 colocation c_mysql_on_drbd inf: g_mysql ms_drbd_mysql:Master
 order o_drbd_before_mysql inf: ms_drbd_mysql:promote g_mysql:start
 property $id=cib-bootstrap-options \
 dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 stonith-enabled=false \
 no-quorum-policy=ignore \
 maintenance-mode=true \
 last-lrm-refresh=1368118902 \
 stop-all-resources=false



 But, the resource p_mysql  never start. And all resource have a
 “unmanaged” mark in status:

 crm resource status
 Resource Group: g_mysql
  p_ip_mysql (ocf::heartbeat:IPaddr2) Started  (unmanaged)
  p_fs_mysql (ocf::heartbeat:Filesystem) Started  (unmanaged)
  p_mysql(ocf::heartbeat:mysql) Stopped  (unmanaged)
 Master/Slave Set: ms_drbd_mysql [p_drbd_mysql] (unmanaged)
  p_drbd_mysql:0 (ocf::linbit:drbd) Master  (unmanaged)
  p_drbd_mysql:1 (ocf::linbit:drbd) Slave  (unmanaged)
 Anyone can help me ?


 Thanks.
 -chen




 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Move Resources

2013-05-09 Thread Jake Smith

- Original Message -
 From: Fusaro Marcelo Damian mfus...@cnc.gov.ar
 To: pacemaker@oss.clusterlabs.org
 Sent: Thursday, May 9, 2013 10:02:47 AM
 Subject: [Pacemaker]  Move Resources
 
 
 
 
 
 I have configured a cluster with 2 nodes and 2 resources in an
 active-pasive mode. All is working fine, but i want to set that when
 a resource fails all resources move to the pasive node. Is this
 possible?

Yes - look at collocation, resource stickiness, and migration threshold.

HTH

Jake

 
 
 
 Thanks
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] resource starts but then fails right away

2013-05-09 Thread Brian J. Murrell
Using Pacemaker 1.1.7 on EL6.3, I'm getting an intermittent recurrence
of a situation where I add a resource and start it and it seems to
start but then right away fail.  i.e.

# clean up resource before trying to start, just to make sure we start with a 
clean slate
# crm resource cleanup testfs-resource1
Cleaning up testfs-resource1 on node1

Waiting for 2 replies from the CRMd.. OK

# now try to start it
# crm_resource -r testfs-resource1 -p target-role -m -v Started

# monitor teh start up for success
# crm resource status testfs-resource1:

resource testfs-resource1 is NOT running

# crm resource status testfs-resource1

resource testfs-resource1 is NOT running

# crm resource status testfs-resource1

resource testfs-resource1 is NOT running

...

# crm resource status testfs-resource1

resource testfs-resource1 is NOT running

# crm resource status testfs-resource1

resource testfs-resource1 is NOT running

# crm resource status testfs-resource1
resource testfs-resource1 is running on: node1

# it started.  check once more:

# crm status

Last updated: Tue May  7 02:37:34 2013
Last change: Tue May  7 02:36:17 2013 via crm_resource on node1
Stack: openais
Current DC: node1 - partition WITHOUT quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
1 Nodes configured, 2 expected votes
3 Resources configured.


Online: [ node1 ]

 st-fencing (stonith:fence_foo):Started node1
 resource2  (ocf::foo:Target):  Started node1
 testfs-resource1   (ocf::foo:Target):  Started node1 FAILED

Failed actions:
testfs-resource1_monitor_0 (node=node1, call=-1, rc=1, status=Timed Out): 
unknown error

# but lo and behold, it failed, with a monitor operation failing.

# stop it
# crm_resource -r testfs-resource1 -p target-role -m -v Stopped: 0

The syslog for this whole operation, starting with adding the resource
is as follows:

May  7 02:36:12 node1 cib[16831]: info: cib:diff: - cib admin_epoch=0 
epoch=15 num_updates=4 /
May  7 02:36:12 node1 crmd[16836]: info: abort_transition_graph: 
te_update_diff:126 - Triggered transition abort (complete=1, tag=diff, 
id=(null), magic=NA, cib=0.16.1) : Non-status change
May  7 02:36:12 node1 crmd[16836]:   notice: do_state_transition: State 
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + cib epoch=16 
num_updates=1 admin_epoch=0 validate-with=pacemaker-1.2 
crm_feature_set=3.0.6 update-origin=node1 update-client=crm_resource 
cib-last-written=Tue May  7 02:35:56 2013 have-quorum=0 dc-uuid=node1 
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   configuration 
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + resources 
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   primitive 
class=ocf provider=foo type=Target id=testfs-resource1 
__crm_diff_marker__=added:top 
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + 
meta_attributes id=testfs-resource1-meta_attributes 
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   nvpair 
name=target-role id=testfs-resource1-meta_attributes-target-role 
value=Stopped /
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + 
/meta_attributes
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + operations 
id=testfs-resource1-operations 
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   op 
id=testfs-resource1-monitor-5 interval=5 name=monitor timeout=60 /
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   op 
id=testfs-resource1-start-0 interval=0 name=start timeout=300 /
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   op 
id=testfs-resource1-stop-0 interval=0 name=stop timeout=300 /
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + /operations
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + 
instance_attributes id=testfs-resource1-instance_attributes 
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   nvpair 
id=testfs-resource1-instance_attributes-target name=target 
value=364cfbf8-26dc-44c9-98ad-f8f9d0fafd9a /
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + 
/instance_attributes
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   /primitive
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + /resources
May  7 02:36:12 node1 cib[16831]: info: cib:diff: +   /configuration
May  7 02:36:12 node1 cib[16831]: info: cib:diff: + /cib
May  7 02:36:12 node1 cib[16831]: info: cib_process_request: Operation 
complete: op cib_create for section resources (origin=local/cibadmin/2, 
version=0.16.1): ok (rc=0)
May  7 02:36:12 node1 pengine[16835]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
May  7 02:36:12 node1 crmd[16836]:   notice: do_state_transition: State 
transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 

[Pacemaker] ClusterMon Resource starting multiple instances of crm_mon

2013-05-09 Thread Steven Bambling
I'm having some issues with getting some cluster  monitoring setup and 
configured on a 3 node multi-state cluster.   I'm using Florian's blog as an 
example 
http://floriancrouzat.net/2013/01/monitor-a-pacemaker-cluster-with-ocfpacemakerclustermon-andor-external-agent/.

When I create the primitive resource it starts on one of my nodes but spawns 
multiple instances of crm_mon.  I don't see any reason that would cause it to 
spawn multiple instances, its very odd behavior.

I was also looking for some clarification on what this resource provides….it 
looks to me that it kicks off a crm_mon in daemon mode that will update a .html 
file and with -E it will run an external script.  But the resource itself 
doesn't trigger anything if another resource changes state only if the crm_mon 
process ( monitored with PID ) fails and it has to restart.  If this is correct 
what is the best practice for monitoring additional resource states?

v/r

STEVE


Below are some additional data points. 


Creating the Resource

[root@pgdb2 tmp]# crm configure primitive SNMPMon ocf:pacemaker:ClusterMon \
 params user=root update=30 extra_options=-E 
 /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net \
 op monitor on-fail=restart interval=60


Manual crm_mon output

Last updated: Thu May  9 10:24:30 2013
Last change: Thu May  9 10:20:49 2013 via cibadmin on pgdb2.example.com
Stack: cman
Current DC: pgdb1.example.com - partition with quorum
Version: 1.1.8-7.el6-394e906
3 Nodes configured, unknown expected votes
6 Resources configured.


Node pgdb1.example.com: standby
Online: [ pgdb2.example.com pgdb3.example.com ]

 PG_REP_VIP (ocf::heartbeat:IPaddr2):   Started pgdb2.example.com
 PG_CLI_VIP (ocf::heartbeat:IPaddr2):   Started pgdb2.example.com
 Master/Slave Set: msPGSQL [PGSQL]
 Masters: [ pgdb2.example.com ]
 Slaves: [ pgdb3.example.com ]
 Stopped: [ PGSQL:2 ]
 SNMPMon(ocf::pacemaker:ClusterMon):Started pgdb3.example.com

PS to check for process on pgdb3

[root@pgdb3 tmp]# ps aux | grep crm_mon
root 16097  0.0  0.0  82624  2784 ?S10:20   0:00 
/usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E 
/usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h 
/tmp/ClusterMon_SNMPMon.html
root 16099  0.0  0.0  82624  2660 ?S10:20   0:00 
/usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E 
/usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h 
/tmp/ClusterMon_SNMPMon.html
root 16104  0.0  0.0  82624  2448 ?S10:20   0:00 
/usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E 
/usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h 
/tmp/ClusterMon_SNMPMon.html
root 16515  0.0  0.0 103244   852 pts/0S+   10:21   0:00 grep crm_mon

Output from corosync.log

May 09 10:20:51 [3100] pgdb3.cha.arin.net   lrmd: info: 
process_lrmd_get_rsc_info:  Resource 'SNMPMon' not found (3 active 
resources)
May 09 10:20:51 [3100] pgdb3.cha.arin.net   lrmd: info: 
process_lrmd_rsc_register:  Added 'SNMPMon' to the rsc list (4 active 
resources)
May 09 10:20:52 [3103] pgdb3.cha.arin.net   crmd: info: 
services_os_action_execute: Managed ClusterMon_meta-data_0 process 16010 
exited with rc=0
May 09 10:20:52 [3103] pgdb3.cha.arin.net   crmd:   notice: 
process_lrm_event:  LRM operation SNMPMon_monitor_0 (call=61, rc=7, 
cib-update=28, confirmed=true) not running
May 09 10:20:52 [3103] pgdb3.cha.arin.net   crmd:   notice: 
process_lrm_event:  LRM operation SNMPMon_start_0 (call=64, rc=0, 
cib-update=29, confirmed=true) ok
May 09 10:20:52 [3103] pgdb3.cha.arin.net   crmd:   notice: 
process_lrm_event:  LRM operation SNMPMon_monitor_6 (call=67, rc=0, 
cib-update=30, confirmed=false) ok

signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crm_mon failed with upgrade failed message

2013-05-09 Thread Michal Fiala
On 05/08/2013 01:17 AM, Andrew Beekhof wrote:
 
 On 07/05/2013, at 11:42 PM, Michal Fiala fi...@mobil.cz wrote:
 
 Hallo,

 I have updated corosync/pacemaker cluster, versions see bellow. Cluster
 is working fine, but when I change configuration via crm configure edit,
 crm_mon is exited with error message:

 Your current configuration could only be upgarded to null... the
 minimum requirement is pacemaker-1.0. Connection to the CIB terminated.
 Reconnecting...Upgrade failed: Update does not conform to the configured
 schema.
 Screenshot is in attachment, crm_mon.png.

 I have done these steps, problem still exists.

 drbd0 ~ # crm_verify -L || echo failed
 drbd0 ~ #

 drbd0 ~ # cibadmin -u --force || echo failed
 drbd0 ~ #

 My configuration see attachment cib.xml (cibadmin -Q).

 Update from:
 sys-cluster/corosync-1.4.4
 sys-cluster/pacemaker-1.1.6.1
 to:
 sys-cluster/corosync-1.4.5
 sys-cluster/pacemaker-1.1.8-r2
 sys-cluster/crmsh-1.2.5

 Please how to fix this problem?
 
 You'll need to update I'm afraid.
 
 You're hitting the condition addressed by 
 https://github.com/beekhof/pacemaker/commit/70b292b
 

I have patched to 1.1.9-138556c (it is called
sys-cluster/pacemaker-1.1.10_rc1 in gentoo) and this problem was fixed.
But all my commnets in pacemaker configuration have gone. I have tried
to add new comments (crm configure edit + add comment), but after
configuration commit, comments are cleaned. Comments are not supported now?

Thanks

 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] 1.1.8 not compatible with 1.1.7?

2013-05-09 Thread Rainer Brestan

Hi Andrew,

yes, this clarifies a lot.

Seems that it is really time to throw away the plugin.

The CMAN solution wont be able (at least from the documentation) to attach new nodes without reconfiguration and restart CMAN on the existing nodes.

The alternative is corosync 2.x.

ClusterLabs has a quite long list of corosync versions from branch 2.0, 2.1, 2.2 und 2.3.

Beside the current reported issue of version 2.3, which version does ClusterLabs use for its regression test.

I found somewhere a note for 2.1.x, is this true ?

Rainer



Gesendet:Donnerstag, 09. Mai 2013 um 04:31 Uhr
Von:Andrew Beekhof and...@beekhof.net
An:The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
Betreff:Re: [Pacemaker] 1.1.8 not compatible with 1.1.7?


On 08/05/2013, at 4:53 PM, Andrew Beekhof and...@beekhof.net wrote:


 On 08/05/2013, at 4:08 PM, Andrew Beekhof and...@beekhof.net wrote:


 On 03/05/2013, at 8:46 PM, Rainer Brestan rainer.bres...@gmx.net wrote:

 Now i have all the logs for some combinations.

 Corosync: 1.4.1-7 for all the tests on all nodes
 Base is always fresh installation of each node with all packages equal except pacemaker version.
 int2node1 node id: 1743917066
 int2node2 node id: 1777471498

 In each ZIP file log from both nodes and the status output of crm_mon and cibadmin -Q is included.

 1.) 1.1.8-4 attaches to running 1.1.7-6 cluster
 https://www.dropbox.com/s/06oyrle4ny47uv9/attach_1.1.8-4_to_1.1.7-6.zip
 Result: join outstanding

 2.) 1.1.9-2 attaches to running 1.1.7-6 cluster
 https://www.dropbox.com/s/fv5kcm2yb5jz56z/attach_1.1.9-2_to_1.1.7-6.zip
 Result: join outstanding

 Neither side is seeing anything from the other, which is very unexpected.
 I notice youre using the plugin... which acts as a message router.

 So I suspect something in there has changed (though Im at a loss to say what) and that cman based clusters are unaffected.

 Confirmed, cman clusters are unaffected.
 Im yet to work out what changed in the plugin.

I worked it out...

The Red Hat changelog for 1.1.8-2 originally contained

+- Cman is the only supported membership  quorum provider, do not ship the corosync plugin

When this decision was reversed (when I realised no-one was seeing the ERROR logs indicating it was going away), I neglected to re-instate the following distro specific patch (which avoided conflicts between the ID used by CMAN and Pacemaker):

diff --git a/configure.ac b/configure.ac
index a3784d5..dafa9e2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1133,7 +1133,7 @@ AC_MSG_CHECKING(for native corosync)
COROSYNC_LIBS=
CS_USES_LIBQB=0

-PCMK_SERVICE_ID=9
+PCMK_SERVICE_ID=10
LCRSODIR=libdir

if test SUPPORT_CS = no; then


So Pacemaker  6.4 is talking on slot 10, while Pacemaker == 6.4 is using slot 9.
This is why the two versions cannot see each other :-(
Im very sorry.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] 1.1.8 not compatible with 1.1.7?

2013-05-09 Thread Andrew Beekhof

On 10/05/2013, at 6:05 AM, Rainer Brestan rainer.bres...@gmx.net wrote:

 Hi Andrew,
 yes, this clarifies a lot.
 Seems that it is really time to throw away the plugin.
 The CMAN solution wont be able (at least from the documentation) to attach 
 new nodes without reconfiguration and restart CMAN on the existing nodes

That doesn't sound right to me.
CC'ing Fabio who should know more (or who does)

 .
 The alternative is corosync 2.x.

Not on RHEL6 - unless you're building things yourself of course.

 ClusterLabs has a quite long list of corosync versions from branch 2.0, 2.1, 
 2.2 und 2.3.
 Beside the current reported issue of version 2.3, which version does 
 ClusterLabs use for its regression test.
 I found somewhere a note for 2.1.x, is this true ?

According to rpm, I've been using:

 Source RPM  : corosync-2.3.0-1.1.2c22.el7.src.rpm
and
 Source RPM  : corosync-2.3.0-1.fc18.src.rpm



 Rainer
  
 Gesendet: Donnerstag, 09. Mai 2013 um 04:31 Uhr
 Von: Andrew Beekhof and...@beekhof.net
 An: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Betreff: Re: [Pacemaker] 1.1.8 not compatible with 1.1.7?
 
 On 08/05/2013, at 4:53 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 
  On 08/05/2013, at 4:08 PM, Andrew Beekhof and...@beekhof.net wrote:
 
 
  On 03/05/2013, at 8:46 PM, Rainer Brestan rainer.bres...@gmx.net wrote:
 
  Now i have all the logs for some combinations.
 
  Corosync: 1.4.1-7 for all the tests on all nodes
  Base is always fresh installation of each node with all packages equal 
  except pacemaker version.
  int2node1 node id: 1743917066
  int2node2 node id: 1777471498
 
  In each ZIP file log from both nodes and the status output of crm_mon and 
  cibadmin -Q is included.
 
  1.) 1.1.8-4 attaches to running 1.1.7-6 cluster
  https://www.dropbox.com/s/06oyrle4ny47uv9/attach_1.1.8-4_to_1.1.7-6.zip
  Result: join outstanding
 
  2.) 1.1.9-2 attaches to running 1.1.7-6 cluster
  https://www.dropbox.com/s/fv5kcm2yb5jz56z/attach_1.1.9-2_to_1.1.7-6.zip
  Result: join outstanding
 
  Neither side is seeing anything from the other, which is very unexpected.
  I notice you're using the plugin... which acts as a message router.
 
  So I suspect something in there has changed (though I'm at a loss to say 
  what) and that cman based clusters are unaffected.
 
  Confirmed, cman clusters are unaffected.
  I'm yet to work out what changed in the plugin.
 
 I worked it out...
 
 The Red Hat changelog for 1.1.8-2 originally contained
 
 +- Cman is the only supported membership  quorum provider, do not ship the 
 corosync plugin
 
 When this decision was reversed (when I realised no-one was seeing the ERROR 
 logs indicating it was going away), I neglected to re-instate the following 
 distro specific patch (which avoided conflicts between the ID used by CMAN 
 and Pacemaker):
 
 diff --git a/configure.ac b/configure.ac
 index a3784d5..dafa9e2 100644
 --- a/configure.ac
 +++ b/configure.ac
 @@ -1133,7 +1133,7 @@ AC_MSG_CHECKING(for native corosync)
 COROSYNC_LIBS=
 CS_USES_LIBQB=0
 
 -PCMK_SERVICE_ID=9
 +PCMK_SERVICE_ID=10
 LCRSODIR=$libdir
 
 if test $SUPPORT_CS = no; then
 
 
 So Pacemaker  6.4 is talking on slot 10, while Pacemaker == 6.4 is using 
 slot 9.
 This is why the two versions cannot see each other :-(
 I'm very sorry.
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] SmartOS / illumos

2013-05-09 Thread Dalho PARK
Hello,
I'm trying to compile pacemaker on SmartOS and having error during make.
Does anyone has  already successfully compiled on SmartOS? Or
Can Someone help to solve the problem I'm having now.
Thank you,

Glib version : glib2-2.34.3

GCC version:
[root@web01 ~/pacemaker]# gcc -v
Using built-in specs.
COLLECT_GCC=/opt/local/gcc47/bin/gcc
COLLECT_LTO_WRAPPER=/opt/local/gcc47/libexec/gcc/x86_64-sun-solaris2.11/4.7.2/lto-wrapper
Target: x86_64-sun-solaris2.11
Configured with: ../gcc-4.7.2/configure --enable-languages='c go fortran c++' 
--enable-shared --enable-long-long --with-local-prefix=/opt/local/gcc47 
--enable-libssp --enable-threads=posix --with-boot-ldflags='-static-libstdc++ 
-static-libgcc -Wl,-R/opt/local/lib ' --disable-nls --enable-__cxa_atexit 
--with-gxx-include-dir=/opt/local/gcc47/include/c++/ --without-gnu-ld 
--with-ld=/usr/bin/ld --with-gnu-as --with-as=/opt/local/bin/gas 
--prefix=/opt/local/gcc47 --build=x86_64-sun-solaris2.11 
--host=x86_64-sun-solaris2.11 --infodir=/opt/local/gcc47/info 
--mandir=/opt/local/gcc47/man
Thread model: posix
gcc version 4.7.2 (GCC)


Here is the error log of make
+++
gmake[2]: Entering directory `/root/pacemaker/lib/common'
  CC   ipc.lo
In file included from ../../include/crm_internal.h:26:0,
 from ipc.c:19:
../../include/portability.h:81:16: error: expected '=', ',', ';', 'asm' or 
'__attribute__' before '*' token
../../include/portability.h:86:1: error: expected identifier or '(' before '}' 
token
../../include/portability.h:86:1: error: useless type name in empty declaration 
[-Werror]
../../include/portability.h:98:1: error: static declaration of 
'g_hash_table_get_values' follows non-static declaration
In file included from /opt/local/include/glib/glib-2.0/glib.h:52:0,
 from ../../include/portability.h:79,
 from ../../include/crm_internal.h:26,
 from ipc.c:19:
/opt/local/include/glib/glib-2.0/glib/ghash.h:101:13: note: previous 
declaration of 'g_hash_table_get_values' was here
In file included from ../../include/crm_internal.h:26:0,
 from ipc.c:19:
../../include/portability.h: In function 'g_hash_table_nth_data':
../../include/portability.h:111:13: error: 'GHashTableIter' has no member named 
'lpc'
../../include/portability.h:111:28: error: 'GHashTableIter' has no member named 
'nth'
../../include/portability.h:112:13: error: 'GHashTableIter' has no member named 
'key'
../../include/portability.h:113:13: error: 'GHashTableIter' has no member named 
'value'
../../include/portability.h: At top level:
../../include/portability.h:121:1: error: static declaration of 
'g_hash_table_iter_init' follows non-static declaration
In file included from /opt/local/include/glib/glib-2.0/glib.h:52:0,
 from ../../include/portability.h:79,
 from ../../include/crm_internal.h:26,
 from ipc.c:19:
/opt/local/include/glib/glib-2.0/glib/ghash.h:103:13: note: previous 
declaration of 'g_hash_table_iter_init' was here
In file included from ../../include/crm_internal.h:26:0,
 from ipc.c:19:
../../include/portability.h: In function 'g_hash_table_iter_init':
../../include/portability.h:123:9: error: 'GHashTableIter' has no member named 
'hash'
../../include/portability.h:124:9: error: 'GHashTableIter' has no member named 
'nth'
../../include/portability.h:125:9: error: 'GHashTableIter' has no member named 
'lpc'
../../include/portability.h:126:9: error: 'GHashTableIter' has no member named 
'key'
../../include/portability.h:127:9: error: 'GHashTableIter' has no member named 
'value'
../../include/portability.h: At top level:
../../include/portability.h:131:1: error: static declaration of 
'g_hash_table_iter_next' follows non-static declaration
In file included from /opt/local/include/glib/glib-2.0/glib.h:52:0,
 from ../../include/portability.h:79,
 from ../../include/crm_internal.h:26,
 from ipc.c:19:
/opt/local/include/glib/glib-2.0/glib/ghash.h:105:13: note: previous 
declaration of 'g_hash_table_iter_next' was here
In file included from ../../include/crm_internal.h:26:0,
 from ipc.c:19:
../../include/portability.h: In function 'g_hash_table_iter_next':
../../include/portability.h:135:9: error: 'GHashTableIter' has no member named 
'lpc'
../../include/portability.h:136:9: error: 'GHashTableIter' has no member named 
'key'
../../include/portability.h:137:9: error: 'GHashTableIter' has no member named 
'value'
../../include/portability.h:138:13: error: 'GHashTableIter' has no member named 
'nth'
../../include/portability.h:138:43: error: 'GHashTableIter' has no member named 
'hash'
../../include/portability.h:139:42: error: 'GHashTableIter' has no member named 
'hash'
../../include/portability.h:140:13: error: 'GHashTableIter' has no member named 
'nth'
../../include/portability.h:143:20: error: 'GHashTableIter' has no member named 
'key'

Re: [Pacemaker] Using fence_sanlock with pacemaker 1.1.8-7.el6

2013-05-09 Thread John McCabe
On Thu, May 9, 2013 at 3:12 AM, Andrew Beekhof and...@beekhof.net wrote:


 On 08/05/2013, at 11:52 PM, John McCabe j...@johnmccabe.net wrote:

  Hi,
  I've been trying, unsuccessfully, to get fence_sanlock running as a
 fence device within pacemaker 1.1.8 in Centos64.
 
  I've set the pcmk_host_argument=host_id

 You mean the literal string host_id or the true value?
 Might be better to send us the actual config you're using along with log
 files.

 Also, what does fence_sanlock -o metadata say?


[root@fee ~]# fence_sanlock -o metadata
?xml version=1.0 ?
resource-agent name=fence_sanlock shortdesc=Fence agent for watchdog
and shared storage
longdesc
fence_sanlock is an i/o fencing agent that uses the watchdog device to
reset nodes.  Shared storage (block or file) is used by sanlock to ensure
that fenced nodes are reset, and to notify partitioned nodes that they
need to be reset.
/longdesc
vendor-urlhttp://www.redhat.com//vendor-url
parameters
parameter name=action unique=0 required=1
getopt mixed=-o lt;actiongt; /
content type=string default=off /
shortdesc lang=enFencing Action/shortdesc
/parameter
parameter name=path unique=0 required=1
getopt mixed=-p lt;actiongt; /
content type=string /
shortdesc lang=enPath to sanlock shared
storage/shortdesc
/parameter
parameter name=host_id unique=0 required=1
getopt mixed=-i lt;actiongt; /
content type=string /
shortdesc lang=enHost id for sanlock (1-128)/shortdesc
/parameter
/parameters
actions
action name=on /
action name=off /
action name=status /
action name=metadata /
action name=sanlock_init /
/actions
/resource-agent


I'd set the pcmk_host_argument to the literal string since the errors
thrown complain that the host_id param is missing and I'd assumed that the
with the pcmk_host_map also set we'd end up passing the mapped id rather
than the hostname (attached an archive with /var/log/messages covering when
the stonith device is added with pcs):


  May 10 01:33:42 fee stonith-ng[10542]:  warning: log_operation:
st-sanlock:10725 [ host_id argument required ]

  pcs -f stonith_cfg_sanlock stonith create st-sanlock fence_sanlock
path=/dev/mapper/vg_shared-lv_sanlock pcmk_host_list=fee-1 fi-1
pcmk_host_map=fee-1:1;fi-1:2 pcmk_host_argument=host_id


Taking a closer look at the fence_sanlock script itself (from
fence-sanlock-2.6-2.el6.x86_64) and it doesn't appear to support a monitor
operation, which led me to suspect that it didn't actually support being
used with pacemaker, at least without possibly having to update the agent
script. Setting pcmk_monitor_action=status didn't help either as it still
fails requesting the host_id be set.

I ended up getting sbd up and running as an interim solution - but I'd
really like to be able to stick with a fencing agent thats got a future in
RHEL where possible. Is the expectation/intention that all fencing agents
be compatible with pacemaker?

 along with pcmk_host_map, pcmk_host_list and path.
 
  But it later complains that its unable to process the monitor operation
 since no host_id is provided.. I'd have assumed that the pcmk_host_argument
 would have performed a mapping, but it seems not to in the case of the
 monitor operation. When pcs stonith list returned fence_sanlock in its list
 of agents I'd hoped it was going to be straightforward.
 
  Is fence_sanlock actually compatible with pacemaker, and has anyone had
 success using it with pacemaker rather than just directly within CMAN?
 
  Yours confused,
  John
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



messages.sanlock.gz
Description: GNU Zip compressed data
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] SmartOS / illumos

2013-05-09 Thread Andrew Beekhof
Looks like you need https://github.com/beekhof/pacemaker/commit/629aa36
I recall making that change once before but it got lost somehow.

On 10/05/2013, at 10:02 AM, Dalho PARK dp...@smart-trade.net wrote:

 Hello,
 I’m trying to compile pacemaker on SmartOS and having error during make.
 Does anyone has  already successfully compiled on SmartOS? Or
 Can Someone help to solve the problem I’m having now.
 Thank you,
  
 Glib version : glib2-2.34.3
  
 GCC version:
 [root@web01 ~/pacemaker]# gcc -v
 Using built-in specs.
 COLLECT_GCC=/opt/local/gcc47/bin/gcc
 COLLECT_LTO_WRAPPER=/opt/local/gcc47/libexec/gcc/x86_64-sun-solaris2.11/4.7.2/lto-wrapper
 Target: x86_64-sun-solaris2.11
 Configured with: ../gcc-4.7.2/configure --enable-languages='c go fortran c++' 
 --enable-shared --enable-long-long --with-local-prefix=/opt/local/gcc47 
 --enable-libssp --enable-threads=posix --with-boot-ldflags='-static-libstdc++ 
 -static-libgcc -Wl,-R/opt/local/lib ' --disable-nls --enable-__cxa_atexit 
 --with-gxx-include-dir=/opt/local/gcc47/include/c++/ --without-gnu-ld 
 --with-ld=/usr/bin/ld --with-gnu-as --with-as=/opt/local/bin/gas 
 --prefix=/opt/local/gcc47 --build=x86_64-sun-solaris2.11 
 --host=x86_64-sun-solaris2.11 --infodir=/opt/local/gcc47/info 
 --mandir=/opt/local/gcc47/man
 Thread model: posix
 gcc version 4.7.2 (GCC)
  
  
 Here is the error log of make
 +++
 gmake[2]: Entering directory `/root/pacemaker/lib/common'
   CC   ipc.lo
 In file included from ../../include/crm_internal.h:26:0,
  from ipc.c:19:
 ../../include/portability.h:81:16: error: expected '=', ',', ';', 'asm' or 
 '__attribute__' before '*' token
 ../../include/portability.h:86:1: error: expected identifier or '(' before 
 '}' token
 ../../include/portability.h:86:1: error: useless type name in empty 
 declaration [-Werror]
 ../../include/portability.h:98:1: error: static declaration of 
 'g_hash_table_get_values' follows non-static declaration
 In file included from /opt/local/include/glib/glib-2.0/glib.h:52:0,
  from ../../include/portability.h:79,
  from ../../include/crm_internal.h:26,
  from ipc.c:19:
 /opt/local/include/glib/glib-2.0/glib/ghash.h:101:13: note: previous 
 declaration of 'g_hash_table_get_values' was here
 In file included from ../../include/crm_internal.h:26:0,
  from ipc.c:19:
 ../../include/portability.h: In function 'g_hash_table_nth_data':
 ../../include/portability.h:111:13: error: 'GHashTableIter' has no member 
 named 'lpc'
 ../../include/portability.h:111:28: error: 'GHashTableIter' has no member 
 named 'nth'
 ../../include/portability.h:112:13: error: 'GHashTableIter' has no member 
 named 'key'
 ../../include/portability.h:113:13: error: 'GHashTableIter' has no member 
 named 'value'
 ../../include/portability.h: At top level:
 ../../include/portability.h:121:1: error: static declaration of 
 'g_hash_table_iter_init' follows non-static declaration
 In file included from /opt/local/include/glib/glib-2.0/glib.h:52:0,
  from ../../include/portability.h:79,
  from ../../include/crm_internal.h:26,
  from ipc.c:19:
 /opt/local/include/glib/glib-2.0/glib/ghash.h:103:13: note: previous 
 declaration of 'g_hash_table_iter_init' was here
 In file included from ../../include/crm_internal.h:26:0,
  from ipc.c:19:
 ../../include/portability.h: In function 'g_hash_table_iter_init':
 ../../include/portability.h:123:9: error: 'GHashTableIter' has no member 
 named 'hash'
 ../../include/portability.h:124:9: error: 'GHashTableIter' has no member 
 named 'nth'
 ../../include/portability.h:125:9: error: 'GHashTableIter' has no member 
 named 'lpc'
 ../../include/portability.h:126:9: error: 'GHashTableIter' has no member 
 named 'key'
 ../../include/portability.h:127:9: error: 'GHashTableIter' has no member 
 named 'value'
 ../../include/portability.h: At top level:
 ../../include/portability.h:131:1: error: static declaration of 
 'g_hash_table_iter_next' follows non-static declaration
 In file included from /opt/local/include/glib/glib-2.0/glib.h:52:0,
  from ../../include/portability.h:79,
  from ../../include/crm_internal.h:26,
  from ipc.c:19:
 /opt/local/include/glib/glib-2.0/glib/ghash.h:105:13: note: previous 
 declaration of 'g_hash_table_iter_next' was here
 In file included from ../../include/crm_internal.h:26:0,
  from ipc.c:19:
 ../../include/portability.h: In function 'g_hash_table_iter_next':
 ../../include/portability.h:135:9: error: 'GHashTableIter' has no member 
 named 'lpc'
 ../../include/portability.h:136:9: error: 'GHashTableIter' has no member 
 named 'key'
 ../../include/portability.h:137:9: error: 'GHashTableIter' has no member 
 named 'value'
 ../../include/portability.h:138:13: error: 'GHashTableIter' has no member 
 named 'nth'
 ../../include/portability.h:138:43: 

Re: [Pacemaker] Using fence_sanlock with pacemaker 1.1.8-7.el6

2013-05-09 Thread John McCabe
Forgot to include my cman config...

?xml version=1.0?
cluster config_version=1 name=ngwcluster
  logging debug=off/
  clusternodes
clusternode name=fee-1 nodeid=1
  fence
method name=pcmk-redirect
  device name=pcmk port=fee-1/
/method
  /fence
/clusternode
clusternode name=fi-1 nodeid=2
  fence
method name=pcmk-redirect
  device name=pcmk port=fi-1/
/method
  /fence
/clusternode
  /clusternodes
  fencedevices
fencedevice name=pcmk agent=fence_pcmk/
  /fencedevices
  cman two_node=1 expected_votes=1
  /cman
/cluster


On Fri, May 10, 2013 at 2:18 AM, John McCabe j...@johnmccabe.net wrote:




 On Thu, May 9, 2013 at 3:12 AM, Andrew Beekhof and...@beekhof.net wrote:


 On 08/05/2013, at 11:52 PM, John McCabe j...@johnmccabe.net wrote:

  Hi,
  I've been trying, unsuccessfully, to get fence_sanlock running as a
 fence device within pacemaker 1.1.8 in Centos64.
 
  I've set the pcmk_host_argument=host_id

 You mean the literal string host_id or the true value?
 Might be better to send us the actual config you're using along with log
 files.

 Also, what does fence_sanlock -o metadata say?


 [root@fee ~]# fence_sanlock -o metadata
 ?xml version=1.0 ?
 resource-agent name=fence_sanlock shortdesc=Fence agent for watchdog
 and shared storage
 longdesc
 fence_sanlock is an i/o fencing agent that uses the watchdog device to
 reset nodes.  Shared storage (block or file) is used by sanlock to ensure
 that fenced nodes are reset, and to notify partitioned nodes that they
 need to be reset.
 /longdesc
 vendor-urlhttp://www.redhat.com//vendor-url
 parameters
 parameter name=action unique=0 required=1
 getopt mixed=-o lt;actiongt; /
 content type=string default=off /
 shortdesc lang=enFencing Action/shortdesc
 /parameter
 parameter name=path unique=0 required=1
 getopt mixed=-p lt;actiongt; /
 content type=string /
 shortdesc lang=enPath to sanlock shared
 storage/shortdesc
  /parameter
 parameter name=host_id unique=0 required=1
 getopt mixed=-i lt;actiongt; /
 content type=string /
 shortdesc lang=enHost id for sanlock
 (1-128)/shortdesc
 /parameter
 /parameters
 actions
 action name=on /
 action name=off /
 action name=status /
 action name=metadata /
 action name=sanlock_init /
 /actions
 /resource-agent


 I'd set the pcmk_host_argument to the literal string since the errors
 thrown complain that the host_id param is missing and I'd assumed that the
 with the pcmk_host_map also set we'd end up passing the mapped id rather
 than the hostname (attached an archive with /var/log/messages covering when
 the stonith device is added with pcs):


   May 10 01:33:42 fee stonith-ng[10542]:  warning: log_operation:
 st-sanlock:10725 [ host_id argument required ]

   pcs -f stonith_cfg_sanlock stonith create st-sanlock fence_sanlock
 path=/dev/mapper/vg_shared-lv_sanlock pcmk_host_list=fee-1 fi-1
 pcmk_host_map=fee-1:1;fi-1:2 pcmk_host_argument=host_id


 Taking a closer look at the fence_sanlock script itself (from
 fence-sanlock-2.6-2.el6.x86_64) and it doesn't appear to support a monitor
 operation, which led me to suspect that it didn't actually support being
 used with pacemaker, at least without possibly having to update the agent
 script. Setting pcmk_monitor_action=status didn't help either as it still
 fails requesting the host_id be set.

 I ended up getting sbd up and running as an interim solution - but I'd
 really like to be able to stick with a fencing agent thats got a future in
 RHEL where possible. Is the expectation/intention that all fencing agents
 be compatible with pacemaker?

  along with pcmk_host_map, pcmk_host_list and path.
 
  But it later complains that its unable to process the monitor operation
 since no host_id is provided.. I'd have assumed that the pcmk_host_argument
 would have performed a mapping, but it seems not to in the case of the
 monitor operation. When pcs stonith list returned fence_sanlock in its list
 of agents I'd hoped it was going to be straightforward.
 
  Is fence_sanlock actually compatible with pacemaker, and has anyone had
 success using it with pacemaker rather than just directly within CMAN?
 
  Yours confused,
  John
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: 

Re: [Pacemaker] Using fence_sanlock with pacemaker 1.1.8-7.el6

2013-05-09 Thread Andrew Beekhof

On 10/05/2013, at 11:18 AM, John McCabe j...@johnmccabe.net wrote:

 
 
 
 On Thu, May 9, 2013 at 3:12 AM, Andrew Beekhof and...@beekhof.net wrote:
 
 On 08/05/2013, at 11:52 PM, John McCabe j...@johnmccabe.net wrote:
 
  Hi,
  I've been trying, unsuccessfully, to get fence_sanlock running as a fence 
  device within pacemaker 1.1.8 in Centos64.
 
  I've set the pcmk_host_argument=host_id
 
 You mean the literal string host_id or the true value?
 Might be better to send us the actual config you're using along with log 
 files.
 
 Also, what does fence_sanlock -o metadata say? 
 
 [root@fee ~]# fence_sanlock -o metadata
 ?xml version=1.0 ?
 resource-agent name=fence_sanlock shortdesc=Fence agent for watchdog and 
 shared storage
 longdesc
 fence_sanlock is an i/o fencing agent that uses the watchdog device to
 reset nodes.  Shared storage (block or file) is used by sanlock to ensure
 that fenced nodes are reset, and to notify partitioned nodes that they
 need to be reset.
 /longdesc
 vendor-urlhttp://www.redhat.com//vendor-url
 parameters
 parameter name=action unique=0 required=1
 getopt mixed=-o lt;actiongt; /
 content type=string default=off /
 shortdesc lang=enFencing Action/shortdesc
 /parameter
 parameter name=path unique=0 required=1
 getopt mixed=-p lt;actiongt; /
 content type=string /
 shortdesc lang=enPath to sanlock shared 
 storage/shortdesc
 /parameter
 parameter name=host_id unique=0 required=1
 getopt mixed=-i lt;actiongt; /
 content type=string /
 shortdesc lang=enHost id for sanlock (1-128)/shortdesc
 /parameter
 /parameters
 actions
 action name=on /
 action name=off /
 action name=status /
 action name=metadata /
 action name=sanlock_init /
 /actions
 /resource-agent
  
 
 I'd set the pcmk_host_argument to the literal string since the errors thrown 
 complain that the host_id param is missing and I'd assumed that the with the 
 pcmk_host_map also set we'd end up passing the mapped id rather than the 
 hostname (attached an archive with /var/log/messages covering when the 
 stonith device is added with pcs):
 
 
   May 10 01:33:42 fee stonith-ng[10542]:  warning: log_operation: 
 st-sanlock:10725 [ host_id argument required ]
 
   pcs -f stonith_cfg_sanlock stonith create st-sanlock fence_sanlock 
 path=/dev/mapper/vg_shared-lv_sanlock pcmk_host_list=fee-1 fi-1 
 pcmk_host_map=fee-1:1;fi-1:2 pcmk_host_argument=host_id
 
 
 Taking a closer look at the fence_sanlock script itself (from 
 fence-sanlock-2.6-2.el6.x86_64) and it doesn't appear to support a monitor 
 operation, which led me to suspect that it didn't actually support being used 
 with pacemaker, at least without possibly having to update the agent script.

Correct

 Setting pcmk_monitor_action=status didn't help either as it still fails 
 requesting the host_id be set.

Also correct, since we're testing the agent itself, not a specific host_id.

 
 I ended up getting sbd up and running as an interim solution - but I'd really 
 like to be able to stick with a fencing agent thats got a future in RHEL 
 where possible. Is the expectation/intention that all fencing agents be 
 compatible with pacemaker?

Yes.  There are a couple of ways in which this agent is not conforming to Red 
Hat's API, could you file a bug against it and CC me please?

 
  along with pcmk_host_map, pcmk_host_list and path.
 
  But it later complains that its unable to process the monitor operation 
  since no host_id is provided.. I'd have assumed that the pcmk_host_argument 
  would have performed a mapping, but it seems not to in the case of the 
  monitor operation. When pcs stonith list returned fence_sanlock in its list 
  of agents I'd hoped it was going to be straightforward.
 
  Is fence_sanlock actually compatible with pacemaker, and has anyone had 
  success using it with pacemaker rather than just directly within CMAN?
 
  Yours confused,
  John
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 messages.sanlock.gz___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: 

Re: [Pacemaker] resource starts but then fails right away

2013-05-09 Thread Andrew Beekhof

On 10/05/2013, at 12:26 AM, Brian J. Murrell br...@interlinx.bc.ca wrote:

 I do see the:
 
 May  7 02:37:32 node1 crmd[16836]:error: print_elem: Aborting transition, 
 action lost: [Action 5]: In-flight (id: testfs-resource1_monitor_0, loc: 
 node1, priority: 0)
 
 in the log.  Is that the root cause of the problem?  

Ordinarily I'd have said yes, but I also see:

May  7 02:36:16 node1 crmd[16836]: info: delete_resource: Removing resource 
testfs-resource1 for 18002_crm_resource (internal) on node1
May  7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation 
monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, 
flush delayed
May  7 02:36:16 node1 crmd[16836]: info: lrm_remove_deleted_op: Removing op 
testfs-resource1_monitor_0:8 for deleted resource testfs-resource1

So apparently a badly timed cleanup was run.  Did you do that or was it the crm 
shell?

 If so, what's that
 trying to tell me, exactly?  If not, what is the cause of the problem?
 
 It really can't be the RA timing out since I give the monitor operation
 a 60 second timeout and the status action of the RA only take a few
 seconds at most to run and is not really an operation that can get
 blocked on anything.  It's effectively the grepping of a file.

If the machine is heavily loaded, or just very busy with file I/O, that can 
still take quite a long time.
I've seen IPaddr monitor actions take over a minute for example.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] failure handling on a cloned resource

2013-05-09 Thread Andrew Beekhof

On 07/05/2013, at 5:15 PM, Johan Huysmans johan.huysm...@inuits.be wrote:

 Hi,
 
 I only keep a couple of pe-input file, and that pe-inpurt-1 version was 
 already overwritten.
 I redid my tests as describe in my previous mails.
 
 At the end of the test it was again written to pe-input1, which is included 
 as attachment.

Perfect.
Basically the PE doesn't know how to correctly recognise that 
d_tomcat_monitor_15000 needs to be processed after d_tomcat_last_failure_0:

lrm_rsc_op id=d_tomcat_monitor_15000 
operation_key=d_tomcat_monitor_15000 operation=monitor 
crm-debug-origin=do_update_resource crm_feature_set=3.0.7 
transition-key=18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 
transition-magic=0:0;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 
call-id=44 rc-code=0 op-status=0 interval=15000 
last-rc-change=1367910303 exec-time=0 queue-time=0 
op-digest=0c738dfc69f09a62b7ebf32344fddcf6/
lrm_rsc_op id=d_tomcat_last_failure_0 
operation_key=d_tomcat_monitor_15000 operation=monitor 
crm-debug-origin=do_update_resource crm_feature_set=3.0.7 
transition-key=18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 
transition-magic=0:1;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 
call-id=44 rc-code=1 op-status=0 interval=15000 
last-rc-change=1367909258 exec-time=0 queue-time=0 
op-digest=0c738dfc69f09a62b7ebf32344fddcf6/

which would allow it to recognise that the resource is healthy one again.

I'll see what I can do...

 
 gr.
 Johan
 
 On 2013-05-07 04:08, Andrew Beekhof wrote:
 I have a much clearer idea of the problem you're seeing now, thankyou.
 
 Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ?
 
 On 03/05/2013, at 10:40 PM, Johan Huysmans johan.huysm...@inuits.be wrote:
 
 Hi,
 
 Below you can see my setup and my test, this shows that my cloned resource 
 with on-fail=block does not recover automatically.
 
 My Setup:
 
 # rpm -aq | grep -i pacemaker
 pacemaker-libs-1.1.9-1512.el6.i686
 pacemaker-cluster-libs-1.1.9-1512.el6.i686
 pacemaker-cli-1.1.9-1512.el6.i686
 pacemaker-1.1.9-1512.el6.i686
 
 # crm configure show
 node CSE-1
 node CSE-2
 primitive d_tomcat ocf:ntc:tomcat \
op monitor interval=15s timeout=510s on-fail=block \
op start interval=0 timeout=510s \
params instance_name=NMS monitor_use_ssl=no 
 monitor_urls=/cse/health monitor_timeout=120 \
meta migration-threshold=1
 primitive ip_11 ocf:heartbeat:IPaddr2 \
op monitor interval=10s \
params broadcast=172.16.11.31 ip=172.16.11.31 nic=bond0.111 
 iflabel=ha \
meta migration-threshold=1 failure-timeout=10
 primitive ip_19 ocf:heartbeat:IPaddr2 \
op monitor interval=10s \
params broadcast=172.18.19.31 ip=172.18.19.31 nic=bond0.119 
 iflabel=ha \
meta migration-threshold=1 failure-timeout=10
 group svc-cse ip_19 ip_11
 clone cl_tomcat d_tomcat
 colocation colo_tomcat inf: svc-cse cl_tomcat
 order order_tomcat inf: cl_tomcat svc-cse
 property $id=cib-bootstrap-options \
dc-version=1.1.9-1512.el6-2a917dd \
cluster-infrastructure=cman \
pe-warn-series-max=9 \
no-quorum-policy=ignore \
stonith-enabled=false \
pe-input-series-max=9 \
pe-error-series-max=9 \
last-lrm-refresh=1367582088
 
 Currently only 1 node is available, CSE-1.
 
 
 This is how I am currently testing my setup:
 
 = Starting point: Everything up and running
 
 # crm resource status
 Resource Group: svc-cse
 ip_19(ocf::heartbeat:IPaddr2):Started
 ip_11(ocf::heartbeat:IPaddr2):Started
 Clone Set: cl_tomcat [d_tomcat]
 Started: [ CSE-1 ]
 Stopped: [ d_tomcat:1 ]
 
 = Causing failure: Change system so tomcat is running but has a failure 
 (in attachment step_2.log)
 
 # crm resource status
 Resource Group: svc-cse
 ip_19(ocf::heartbeat:IPaddr2):Stopped
 ip_11(ocf::heartbeat:IPaddr2):Stopped
 Clone Set: cl_tomcat [d_tomcat]
 d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
 Stopped: [ d_tomcat:1 ]
 
 = Fixing failure: Revert system so tomcat is running without failure (in 
 attachment step_3.log)
 
 # crm resource status
 Resource Group: svc-cse
 ip_19(ocf::heartbeat:IPaddr2):Stopped
 ip_11(ocf::heartbeat:IPaddr2):Stopped
 Clone Set: cl_tomcat [d_tomcat]
 d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
 Stopped: [ d_tomcat:1 ]
 
 As you can see in the logs the OCF script doesn't return any failure. This 
 is noticed by pacemaker,
 however it doesn't reflect in crm_mon and it doesn't start the depending 
 resources.
 
 Gr.
 Johan
 
 On 2013-05-03 03:04, Andrew Beekhof wrote:
 On 02/05/2013, at 5:45 PM, Johan Huysmans johan.huysm...@inuits.be wrote:
 
 On 2013-05-01 05:48, Andrew Beekhof wrote:
 On 17/04/2013, at 9:54 PM, Johan Huysmans johan.huysm...@inuits.be 
 wrote:
 
 Hi All,
 
 I'm trying to setup a specific configuration in our cluster, however 
 I'm struggling with my configuration.
 
 This is what I'm trying to achieve:
 On both nodes of the 

Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8

2013-05-09 Thread pavan tc
On Fri, May 10, 2013 at 6:21 AM, Andrew Beekhof and...@beekhof.net wrote:


 On 08/05/2013, at 9:16 PM, pavan tc pavan...@gmail.com wrote:


Hi Andrew,

Thanks much for looking into this. I have some queries inline.


  Hi,
 
  I have a two-node cluster with STONITH disabled.

 Thats not a good idea.


Ok. I'll try and configure stonith.

 I am still running with the pcmk plugin as opposed to the recommended
 CMAN plugin.

 On rhel6?


Yes.



 
  With 1.1.8, I see some messages (appended to this mail) once in a while.
 I do not understand some keywords here - There is a Leave action. I am
 not sure what that is.

 It means the cluster is not going to change the state of the resource.


Why did the cluster execute the Leave action at this point? Is there some
other error that triggers this? Or is it a benign message?


  And, there is a CIB update failure that leads to a RECOVER action. There
 is a message that says the RECOVER action is not supported. Finally this
 leads to a stop and start of my resource.

 Well, and also Pacemaker's crmd process.
 My guess... the node is overloaded which is causing the cib queries to
 time out.


Is there a cib query timeout value that I can set? I was earlier getting
the TOTEM timeout.
So, I set the token to a larger value (5 seconds) in corosync.conf and
things were much better.
But now, I have started hitting this problem.

Thanks,
Pavan

 I can copy the crm configure show output, but nothing special there.
 
  Thanks much.
  Pavan
 
  PS: The resource vha-bcd94724-3ec0-4a8d-8951-9d27be3a6acb is stale. The
 underlying device that represents this resource has been removed. However,
 the resource is still part of the CIB. All errors related to that resource
 can be ignored. But can this cause a node to be stopped/fenced?

 Not if fencing is disabled.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8

2013-05-09 Thread pavan tc

 Is there a cib query timeout value that I can set? I was earlier getting
 the TOTEM timeout.
 So, I set the token to a larger value (5 seconds) in corosync.conf and
 things were much better.
 But now, I have started hitting this problem.


I'll experiment with the cibadmin -t (--timeout) option to see if it helps.
As I can see from the code, the default seems to be 30 ms.
Is there a widely used default for systems with a high load or is it found
out the hard way for each setup?

Pavan
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8

2013-05-09 Thread Andrew Beekhof

On 10/05/2013, at 1:44 PM, pavan tc pavan...@gmail.com wrote:

 
 
 
 On Fri, May 10, 2013 at 6:21 AM, Andrew Beekhof and...@beekhof.net wrote:
 
 On 08/05/2013, at 9:16 PM, pavan tc pavan...@gmail.com wrote:
 
 
 Hi Andrew,
 
 Thanks much for looking into this. I have some queries inline.
  
  Hi,
 
  I have a two-node cluster with STONITH disabled.
 
 Thats not a good idea.
 
 Ok. I'll try and configure stonith.
 
  I am still running with the pcmk plugin as opposed to the recommended CMAN 
  plugin.
 
 On rhel6?
 
 Yes.
  
 
 
  With 1.1.8, I see some messages (appended to this mail) once in a while. I 
  do not understand some keywords here - There is a Leave action. I am not 
  sure what that is.
 
 It means the cluster is not going to change the state of the resource.
 
 Why did the cluster execute the Leave action at this point?

There is no Leave action being executed.  We are simply logging that nothing 
is going to happen to that resource - it is in the state that we exepect/want.

 Is there some other error that triggers this? Or is it a benign message?
 
 
  And, there is a CIB update failure that leads to a RECOVER action. There is 
  a message that says the RECOVER action is not supported. Finally this leads 
  to a stop and start of my resource.
 
 Well, and also Pacemaker's crmd process.
 My guess... the node is overloaded which is causing the cib queries to time 
 out.
 
 
 Is there a cib query timeout value that I can set?

No.  You can set the batch-limit property though, this reduces the rate at 
which CIB operations are attempted

 I was earlier getting the TOTEM timeout.
 So, I set the token to a larger value (5 seconds) in corosync.conf and things 
 were much better.
 But now, I have started hitting this problem.
 
 Thanks,
 Pavan
 
  I can copy the crm configure show output, but nothing special there.
 
  Thanks much.
  Pavan
 
  PS: The resource vha-bcd94724-3ec0-4a8d-8951-9d27be3a6acb is stale. The 
  underlying device that represents this resource has been removed. However, 
  the resource is still part of the CIB. All errors related to that resource 
  can be ignored. But can this cause a node to be stopped/fenced?
 
 Not if fencing is disabled.
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8

2013-05-09 Thread pavan tc


 I'll experiment with the cibadmin -t (--timeout) option to see if it helps.
 As I can see from the code, the default seems to be 30 ms.
 Is there a widely used default for systems with a high load or is it found
 out the hard way for each setup?


Easier said than done. Can someone help with how to use the --timeout
option in cibadmin?

Pavan
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] failure handling on a cloned resource

2013-05-09 Thread Andrew Beekhof
Fixed!

  https://github.com/beekhof/pacemaker/commit/d87de1b

On 10/05/2013, at 11:59 AM, Andrew Beekhof and...@beekhof.net wrote:

 
 On 07/05/2013, at 5:15 PM, Johan Huysmans johan.huysm...@inuits.be wrote:
 
 Hi,
 
 I only keep a couple of pe-input file, and that pe-inpurt-1 version was 
 already overwritten.
 I redid my tests as describe in my previous mails.
 
 At the end of the test it was again written to pe-input1, which is included 
 as attachment.
 
 Perfect.
 Basically the PE doesn't know how to correctly recognise that 
 d_tomcat_monitor_15000 needs to be processed after d_tomcat_last_failure_0:
 
lrm_rsc_op id=d_tomcat_monitor_15000 
 operation_key=d_tomcat_monitor_15000 operation=monitor 
 crm-debug-origin=do_update_resource crm_feature_set=3.0.7 
 transition-key=18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 
 transition-magic=0:0;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 
 call-id=44 rc-code=0 op-status=0 interval=15000 
 last-rc-change=1367910303 exec-time=0 queue-time=0 
 op-digest=0c738dfc69f09a62b7ebf32344fddcf6/
lrm_rsc_op id=d_tomcat_last_failure_0 
 operation_key=d_tomcat_monitor_15000 operation=monitor 
 crm-debug-origin=do_update_resource crm_feature_set=3.0.7 
 transition-key=18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 
 transition-magic=0:1;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 
 call-id=44 rc-code=1 op-status=0 interval=15000 
 last-rc-change=1367909258 exec-time=0 queue-time=0 
 op-digest=0c738dfc69f09a62b7ebf32344fddcf6/
 
 which would allow it to recognise that the resource is healthy one again.
 
 I'll see what I can do...
 
 
 gr.
 Johan
 
 On 2013-05-07 04:08, Andrew Beekhof wrote:
 I have a much clearer idea of the problem you're seeing now, thankyou.
 
 Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ?
 
 On 03/05/2013, at 10:40 PM, Johan Huysmans johan.huysm...@inuits.be wrote:
 
 Hi,
 
 Below you can see my setup and my test, this shows that my cloned resource 
 with on-fail=block does not recover automatically.
 
 My Setup:
 
 # rpm -aq | grep -i pacemaker
 pacemaker-libs-1.1.9-1512.el6.i686
 pacemaker-cluster-libs-1.1.9-1512.el6.i686
 pacemaker-cli-1.1.9-1512.el6.i686
 pacemaker-1.1.9-1512.el6.i686
 
 # crm configure show
 node CSE-1
 node CSE-2
 primitive d_tomcat ocf:ntc:tomcat \
   op monitor interval=15s timeout=510s on-fail=block \
   op start interval=0 timeout=510s \
   params instance_name=NMS monitor_use_ssl=no 
 monitor_urls=/cse/health monitor_timeout=120 \
   meta migration-threshold=1
 primitive ip_11 ocf:heartbeat:IPaddr2 \
   op monitor interval=10s \
   params broadcast=172.16.11.31 ip=172.16.11.31 nic=bond0.111 
 iflabel=ha \
   meta migration-threshold=1 failure-timeout=10
 primitive ip_19 ocf:heartbeat:IPaddr2 \
   op monitor interval=10s \
   params broadcast=172.18.19.31 ip=172.18.19.31 nic=bond0.119 
 iflabel=ha \
   meta migration-threshold=1 failure-timeout=10
 group svc-cse ip_19 ip_11
 clone cl_tomcat d_tomcat
 colocation colo_tomcat inf: svc-cse cl_tomcat
 order order_tomcat inf: cl_tomcat svc-cse
 property $id=cib-bootstrap-options \
   dc-version=1.1.9-1512.el6-2a917dd \
   cluster-infrastructure=cman \
   pe-warn-series-max=9 \
   no-quorum-policy=ignore \
   stonith-enabled=false \
   pe-input-series-max=9 \
   pe-error-series-max=9 \
   last-lrm-refresh=1367582088
 
 Currently only 1 node is available, CSE-1.
 
 
 This is how I am currently testing my setup:
 
 = Starting point: Everything up and running
 
 # crm resource status
 Resource Group: svc-cse
ip_19(ocf::heartbeat:IPaddr2):Started
ip_11(ocf::heartbeat:IPaddr2):Started
 Clone Set: cl_tomcat [d_tomcat]
Started: [ CSE-1 ]
Stopped: [ d_tomcat:1 ]
 
 = Causing failure: Change system so tomcat is running but has a failure 
 (in attachment step_2.log)
 
 # crm resource status
 Resource Group: svc-cse
ip_19(ocf::heartbeat:IPaddr2):Stopped
ip_11(ocf::heartbeat:IPaddr2):Stopped
 Clone Set: cl_tomcat [d_tomcat]
d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
Stopped: [ d_tomcat:1 ]
 
 = Fixing failure: Revert system so tomcat is running without failure (in 
 attachment step_3.log)
 
 # crm resource status
 Resource Group: svc-cse
ip_19(ocf::heartbeat:IPaddr2):Stopped
ip_11(ocf::heartbeat:IPaddr2):Stopped
 Clone Set: cl_tomcat [d_tomcat]
d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED
Stopped: [ d_tomcat:1 ]
 
 As you can see in the logs the OCF script doesn't return any failure. This 
 is noticed by pacemaker,
 however it doesn't reflect in crm_mon and it doesn't start the depending 
 resources.
 
 Gr.
 Johan
 
 On 2013-05-03 03:04, Andrew Beekhof wrote:
 On 02/05/2013, at 5:45 PM, Johan Huysmans johan.huysm...@inuits.be 
 wrote:
 
 On 2013-05-01 05:48, Andrew Beekhof wrote:
 On 17/04/2013, at 9:54 PM, Johan Huysmans johan.huysm...@inuits.be 
 wrote:
 
 Hi All,
 
 I'm trying to setup a specific 

Re: [Pacemaker] Behavior when crm_mon is a daemon

2013-05-09 Thread Yuichi SEINO
Hi,

2013/5/1 Andrew Beekhof and...@beekhof.net:

 On 19/04/2013, at 11:05 AM, Yuichi SEINO seino.clust...@gmail.com wrote:

 HI,

 2013/4/16 Andrew Beekhof and...@beekhof.net:

 On 15/04/2013, at 7:42 PM, Yuichi SEINO seino.clust...@gmail.com wrote:

 Hi All,

 I look at the daemon of tools to make a new daemon. So, I have a question.

 When the old pid file existed, crm_mon is start as a daemon. Then,
 crm_mon don't update this old pid file. And, crm_mon doesn't stop.
 I would like to know if this behavior is correct.

 Some of it is, but the part about crm_mon not updating the pid file (which 
 is probably also preventing it from stopping) is bad.
 I understood that it is a negative behavior.
 If we figure out a problem, I think that we want to fix it.

 Done:

https://github.com/beekhof/pacemaker/commit/e549770

 Plus an extra bonus:

https://github.com/beekhof/pacemaker/commit/479c5cc


Thanks for fixing.
I could check its work.

Sincerely,
Yuichi

--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org