Re: [Pacemaker] mysql resource can't start
For addition, I have already tried to start resource in crm shell. Nothing happened, and there is no log in mysql. Thanks. -Chen 在 2013-5-9,17:10,Li, Chen chen...@intel.commailto:chen...@intel.com 写道: Hi list, I’m a new user to pacemaker. I’m trying to using pacemaker to set up a HA for mysql in active/slave mode. My configuration is: node api01 node api02 primitive p_drbd_mysql ocf:linbit:drbd \ params drbd_resource=mysql \ op start interval=0 timeout=90s \ op stop interval=0 timeout=180s \ op promote interval=0 timeout=180s \ op demote interval=0 timeout=180s \ op monitor interval=30s role=Slave \ op monitor interval=29s role=Master primitive p_fs_mysql ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/mysql directory=/var/lib/mysql fstype=ext4 options=relatime \ op start interval=0 timeout=60s \ op stop interval=0 timeout=180s \ op monitor interval=60s timeout=60s primitive p_ip_mysql ocf:heartbeat:IPaddr2 \ params ip=192.168.11.13 cidr_netmask=16 \ op monitor interval=30s \ meta is-managed=true primitive p_mysql ocf:heartbeat:mysql \ params additional_parameters=--bind-address=192.168.11.13 config=/etc/mysql/my.cnf pid=/var/run/mysqld/mysqld.pid socket=/var/run/mysqld/mysqld.sock log=/var/log/mysql/mysqld.log \ op monitor interval=20s timeout=10s \ op start interval=0 timeout=120s \ op stop interval=0 timeout=120s \ meta is-managed=true group g_mysql p_ip_mysql p_fs_mysql p_mysql \ meta target-role=Started is-managed=true ms ms_drbd_mysql p_drbd_mysql \ meta notify=true clone-max=2 colocation c_mysql_on_drbd inf: g_mysql ms_drbd_mysql:Master order o_drbd_before_mysql inf: ms_drbd_mysql:promote g_mysql:start property $id=cib-bootstrap-options \ dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ maintenance-mode=true \ last-lrm-refresh=1368118902 \ stop-all-resources=false But, the resource p_mysql never start. And all resource have a “unmanaged” mark in status: crm resource status Resource Group: g_mysql p_ip_mysql (ocf::heartbeat:IPaddr2) Started (unmanaged) p_fs_mysql (ocf::heartbeat:Filesystem) Started (unmanaged) p_mysql(ocf::heartbeat:mysql) Stopped (unmanaged) Master/Slave Set: ms_drbd_mysql [p_drbd_mysql] (unmanaged) p_drbd_mysql:0 (ocf::linbit:drbd) Master (unmanaged) p_drbd_mysql:1 (ocf::linbit:drbd) Slave (unmanaged) Anyone can help me ? Thanks. -chen ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] mysql resource can't start
Hello Li Maybe you have all resource in unmanaged state, because you set maintenance-mode=true 2013/5/9 Li, Chen chen...@intel.com For addition, I have already tried to start resource in crm shell. Nothing happened, and there is no log in mysql. Thanks. -Chen 在 2013-5-9,17:10,Li, Chen chen...@intel.commailto:chen...@intel.com 写道: Hi list, I’m a new user to pacemaker. I’m trying to using pacemaker to set up a HA for mysql in active/slave mode. My configuration is: node api01 node api02 primitive p_drbd_mysql ocf:linbit:drbd \ params drbd_resource=mysql \ op start interval=0 timeout=90s \ op stop interval=0 timeout=180s \ op promote interval=0 timeout=180s \ op demote interval=0 timeout=180s \ op monitor interval=30s role=Slave \ op monitor interval=29s role=Master primitive p_fs_mysql ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/mysql directory=/var/lib/mysql fstype=ext4 options=relatime \ op start interval=0 timeout=60s \ op stop interval=0 timeout=180s \ op monitor interval=60s timeout=60s primitive p_ip_mysql ocf:heartbeat:IPaddr2 \ params ip=192.168.11.13 cidr_netmask=16 \ op monitor interval=30s \ meta is-managed=true primitive p_mysql ocf:heartbeat:mysql \ params additional_parameters=--bind-address=192.168.11.13 config=/etc/mysql/my.cnf pid=/var/run/mysqld/mysqld.pid socket=/var/run/mysqld/mysqld.sock log=/var/log/mysql/mysqld.log \ op monitor interval=20s timeout=10s \ op start interval=0 timeout=120s \ op stop interval=0 timeout=120s \ meta is-managed=true group g_mysql p_ip_mysql p_fs_mysql p_mysql \ meta target-role=Started is-managed=true ms ms_drbd_mysql p_drbd_mysql \ meta notify=true clone-max=2 colocation c_mysql_on_drbd inf: g_mysql ms_drbd_mysql:Master order o_drbd_before_mysql inf: ms_drbd_mysql:promote g_mysql:start property $id=cib-bootstrap-options \ dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ maintenance-mode=true \ last-lrm-refresh=1368118902 \ stop-all-resources=false But, the resource p_mysql never start. And all resource have a “unmanaged” mark in status: crm resource status Resource Group: g_mysql p_ip_mysql (ocf::heartbeat:IPaddr2) Started (unmanaged) p_fs_mysql (ocf::heartbeat:Filesystem) Started (unmanaged) p_mysql(ocf::heartbeat:mysql) Stopped (unmanaged) Master/Slave Set: ms_drbd_mysql [p_drbd_mysql] (unmanaged) p_drbd_mysql:0 (ocf::linbit:drbd) Master (unmanaged) p_drbd_mysql:1 (ocf::linbit:drbd) Slave (unmanaged) Anyone can help me ? Thanks. -chen ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Move Resources
- Original Message - From: Fusaro Marcelo Damian mfus...@cnc.gov.ar To: pacemaker@oss.clusterlabs.org Sent: Thursday, May 9, 2013 10:02:47 AM Subject: [Pacemaker] Move Resources I have configured a cluster with 2 nodes and 2 resources in an active-pasive mode. All is working fine, but i want to set that when a resource fails all resources move to the pasive node. Is this possible? Yes - look at collocation, resource stickiness, and migration threshold. HTH Jake Thanks ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] resource starts but then fails right away
Using Pacemaker 1.1.7 on EL6.3, I'm getting an intermittent recurrence of a situation where I add a resource and start it and it seems to start but then right away fail. i.e. # clean up resource before trying to start, just to make sure we start with a clean slate # crm resource cleanup testfs-resource1 Cleaning up testfs-resource1 on node1 Waiting for 2 replies from the CRMd.. OK # now try to start it # crm_resource -r testfs-resource1 -p target-role -m -v Started # monitor teh start up for success # crm resource status testfs-resource1: resource testfs-resource1 is NOT running # crm resource status testfs-resource1 resource testfs-resource1 is NOT running # crm resource status testfs-resource1 resource testfs-resource1 is NOT running ... # crm resource status testfs-resource1 resource testfs-resource1 is NOT running # crm resource status testfs-resource1 resource testfs-resource1 is NOT running # crm resource status testfs-resource1 resource testfs-resource1 is running on: node1 # it started. check once more: # crm status Last updated: Tue May 7 02:37:34 2013 Last change: Tue May 7 02:36:17 2013 via crm_resource on node1 Stack: openais Current DC: node1 - partition WITHOUT quorum Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 1 Nodes configured, 2 expected votes 3 Resources configured. Online: [ node1 ] st-fencing (stonith:fence_foo):Started node1 resource2 (ocf::foo:Target): Started node1 testfs-resource1 (ocf::foo:Target): Started node1 FAILED Failed actions: testfs-resource1_monitor_0 (node=node1, call=-1, rc=1, status=Timed Out): unknown error # but lo and behold, it failed, with a monitor operation failing. # stop it # crm_resource -r testfs-resource1 -p target-role -m -v Stopped: 0 The syslog for this whole operation, starting with adding the resource is as follows: May 7 02:36:12 node1 cib[16831]: info: cib:diff: - cib admin_epoch=0 epoch=15 num_updates=4 / May 7 02:36:12 node1 crmd[16836]: info: abort_transition_graph: te_update_diff:126 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.16.1) : Non-status change May 7 02:36:12 node1 crmd[16836]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] May 7 02:36:12 node1 cib[16831]: info: cib:diff: + cib epoch=16 num_updates=1 admin_epoch=0 validate-with=pacemaker-1.2 crm_feature_set=3.0.6 update-origin=node1 update-client=crm_resource cib-last-written=Tue May 7 02:35:56 2013 have-quorum=0 dc-uuid=node1 May 7 02:36:12 node1 cib[16831]: info: cib:diff: + configuration May 7 02:36:12 node1 cib[16831]: info: cib:diff: + resources May 7 02:36:12 node1 cib[16831]: info: cib:diff: + primitive class=ocf provider=foo type=Target id=testfs-resource1 __crm_diff_marker__=added:top May 7 02:36:12 node1 cib[16831]: info: cib:diff: + meta_attributes id=testfs-resource1-meta_attributes May 7 02:36:12 node1 cib[16831]: info: cib:diff: + nvpair name=target-role id=testfs-resource1-meta_attributes-target-role value=Stopped / May 7 02:36:12 node1 cib[16831]: info: cib:diff: + /meta_attributes May 7 02:36:12 node1 cib[16831]: info: cib:diff: + operations id=testfs-resource1-operations May 7 02:36:12 node1 cib[16831]: info: cib:diff: + op id=testfs-resource1-monitor-5 interval=5 name=monitor timeout=60 / May 7 02:36:12 node1 cib[16831]: info: cib:diff: + op id=testfs-resource1-start-0 interval=0 name=start timeout=300 / May 7 02:36:12 node1 cib[16831]: info: cib:diff: + op id=testfs-resource1-stop-0 interval=0 name=stop timeout=300 / May 7 02:36:12 node1 cib[16831]: info: cib:diff: + /operations May 7 02:36:12 node1 cib[16831]: info: cib:diff: + instance_attributes id=testfs-resource1-instance_attributes May 7 02:36:12 node1 cib[16831]: info: cib:diff: + nvpair id=testfs-resource1-instance_attributes-target name=target value=364cfbf8-26dc-44c9-98ad-f8f9d0fafd9a / May 7 02:36:12 node1 cib[16831]: info: cib:diff: + /instance_attributes May 7 02:36:12 node1 cib[16831]: info: cib:diff: + /primitive May 7 02:36:12 node1 cib[16831]: info: cib:diff: + /resources May 7 02:36:12 node1 cib[16831]: info: cib:diff: + /configuration May 7 02:36:12 node1 cib[16831]: info: cib:diff: + /cib May 7 02:36:12 node1 cib[16831]: info: cib_process_request: Operation complete: op cib_create for section resources (origin=local/cibadmin/2, version=0.16.1): ok (rc=0) May 7 02:36:12 node1 pengine[16835]: notice: unpack_config: On loss of CCM Quorum: Ignore May 7 02:36:12 node1 crmd[16836]: notice: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
[Pacemaker] ClusterMon Resource starting multiple instances of crm_mon
I'm having some issues with getting some cluster monitoring setup and configured on a 3 node multi-state cluster. I'm using Florian's blog as an example http://floriancrouzat.net/2013/01/monitor-a-pacemaker-cluster-with-ocfpacemakerclustermon-andor-external-agent/. When I create the primitive resource it starts on one of my nodes but spawns multiple instances of crm_mon. I don't see any reason that would cause it to spawn multiple instances, its very odd behavior. I was also looking for some clarification on what this resource provides….it looks to me that it kicks off a crm_mon in daemon mode that will update a .html file and with -E it will run an external script. But the resource itself doesn't trigger anything if another resource changes state only if the crm_mon process ( monitored with PID ) fails and it has to restart. If this is correct what is the best practice for monitoring additional resource states? v/r STEVE Below are some additional data points. Creating the Resource [root@pgdb2 tmp]# crm configure primitive SNMPMon ocf:pacemaker:ClusterMon \ params user=root update=30 extra_options=-E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net \ op monitor on-fail=restart interval=60 Manual crm_mon output Last updated: Thu May 9 10:24:30 2013 Last change: Thu May 9 10:20:49 2013 via cibadmin on pgdb2.example.com Stack: cman Current DC: pgdb1.example.com - partition with quorum Version: 1.1.8-7.el6-394e906 3 Nodes configured, unknown expected votes 6 Resources configured. Node pgdb1.example.com: standby Online: [ pgdb2.example.com pgdb3.example.com ] PG_REP_VIP (ocf::heartbeat:IPaddr2): Started pgdb2.example.com PG_CLI_VIP (ocf::heartbeat:IPaddr2): Started pgdb2.example.com Master/Slave Set: msPGSQL [PGSQL] Masters: [ pgdb2.example.com ] Slaves: [ pgdb3.example.com ] Stopped: [ PGSQL:2 ] SNMPMon(ocf::pacemaker:ClusterMon):Started pgdb3.example.com PS to check for process on pgdb3 [root@pgdb3 tmp]# ps aux | grep crm_mon root 16097 0.0 0.0 82624 2784 ?S10:20 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html root 16099 0.0 0.0 82624 2660 ?S10:20 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html root 16104 0.0 0.0 82624 2448 ?S10:20 0:00 /usr/sbin/crm_mon -p /tmp/ClusterMon_SNMPMon.pid -d -i 0 -E /usr/local/bin/pcmk_snmp_helper.sh -e zen.arin.net -h /tmp/ClusterMon_SNMPMon.html root 16515 0.0 0.0 103244 852 pts/0S+ 10:21 0:00 grep crm_mon Output from corosync.log May 09 10:20:51 [3100] pgdb3.cha.arin.net lrmd: info: process_lrmd_get_rsc_info: Resource 'SNMPMon' not found (3 active resources) May 09 10:20:51 [3100] pgdb3.cha.arin.net lrmd: info: process_lrmd_rsc_register: Added 'SNMPMon' to the rsc list (4 active resources) May 09 10:20:52 [3103] pgdb3.cha.arin.net crmd: info: services_os_action_execute: Managed ClusterMon_meta-data_0 process 16010 exited with rc=0 May 09 10:20:52 [3103] pgdb3.cha.arin.net crmd: notice: process_lrm_event: LRM operation SNMPMon_monitor_0 (call=61, rc=7, cib-update=28, confirmed=true) not running May 09 10:20:52 [3103] pgdb3.cha.arin.net crmd: notice: process_lrm_event: LRM operation SNMPMon_start_0 (call=64, rc=0, cib-update=29, confirmed=true) ok May 09 10:20:52 [3103] pgdb3.cha.arin.net crmd: notice: process_lrm_event: LRM operation SNMPMon_monitor_6 (call=67, rc=0, cib-update=30, confirmed=false) ok signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crm_mon failed with upgrade failed message
On 05/08/2013 01:17 AM, Andrew Beekhof wrote: On 07/05/2013, at 11:42 PM, Michal Fiala fi...@mobil.cz wrote: Hallo, I have updated corosync/pacemaker cluster, versions see bellow. Cluster is working fine, but when I change configuration via crm configure edit, crm_mon is exited with error message: Your current configuration could only be upgarded to null... the minimum requirement is pacemaker-1.0. Connection to the CIB terminated. Reconnecting...Upgrade failed: Update does not conform to the configured schema. Screenshot is in attachment, crm_mon.png. I have done these steps, problem still exists. drbd0 ~ # crm_verify -L || echo failed drbd0 ~ # drbd0 ~ # cibadmin -u --force || echo failed drbd0 ~ # My configuration see attachment cib.xml (cibadmin -Q). Update from: sys-cluster/corosync-1.4.4 sys-cluster/pacemaker-1.1.6.1 to: sys-cluster/corosync-1.4.5 sys-cluster/pacemaker-1.1.8-r2 sys-cluster/crmsh-1.2.5 Please how to fix this problem? You'll need to update I'm afraid. You're hitting the condition addressed by https://github.com/beekhof/pacemaker/commit/70b292b I have patched to 1.1.9-138556c (it is called sys-cluster/pacemaker-1.1.10_rc1 in gentoo) and this problem was fixed. But all my commnets in pacemaker configuration have gone. I have tried to add new comments (crm configure edit + add comment), but after configuration commit, comments are cleaned. Comments are not supported now? Thanks ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] 1.1.8 not compatible with 1.1.7?
Hi Andrew, yes, this clarifies a lot. Seems that it is really time to throw away the plugin. The CMAN solution wont be able (at least from the documentation) to attach new nodes without reconfiguration and restart CMAN on the existing nodes. The alternative is corosync 2.x. ClusterLabs has a quite long list of corosync versions from branch 2.0, 2.1, 2.2 und 2.3. Beside the current reported issue of version 2.3, which version does ClusterLabs use for its regression test. I found somewhere a note for 2.1.x, is this true ? Rainer Gesendet:Donnerstag, 09. Mai 2013 um 04:31 Uhr Von:Andrew Beekhof and...@beekhof.net An:The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Betreff:Re: [Pacemaker] 1.1.8 not compatible with 1.1.7? On 08/05/2013, at 4:53 PM, Andrew Beekhof and...@beekhof.net wrote: On 08/05/2013, at 4:08 PM, Andrew Beekhof and...@beekhof.net wrote: On 03/05/2013, at 8:46 PM, Rainer Brestan rainer.bres...@gmx.net wrote: Now i have all the logs for some combinations. Corosync: 1.4.1-7 for all the tests on all nodes Base is always fresh installation of each node with all packages equal except pacemaker version. int2node1 node id: 1743917066 int2node2 node id: 1777471498 In each ZIP file log from both nodes and the status output of crm_mon and cibadmin -Q is included. 1.) 1.1.8-4 attaches to running 1.1.7-6 cluster https://www.dropbox.com/s/06oyrle4ny47uv9/attach_1.1.8-4_to_1.1.7-6.zip Result: join outstanding 2.) 1.1.9-2 attaches to running 1.1.7-6 cluster https://www.dropbox.com/s/fv5kcm2yb5jz56z/attach_1.1.9-2_to_1.1.7-6.zip Result: join outstanding Neither side is seeing anything from the other, which is very unexpected. I notice youre using the plugin... which acts as a message router. So I suspect something in there has changed (though Im at a loss to say what) and that cman based clusters are unaffected. Confirmed, cman clusters are unaffected. Im yet to work out what changed in the plugin. I worked it out... The Red Hat changelog for 1.1.8-2 originally contained +- Cman is the only supported membership quorum provider, do not ship the corosync plugin When this decision was reversed (when I realised no-one was seeing the ERROR logs indicating it was going away), I neglected to re-instate the following distro specific patch (which avoided conflicts between the ID used by CMAN and Pacemaker): diff --git a/configure.ac b/configure.ac index a3784d5..dafa9e2 100644 --- a/configure.ac +++ b/configure.ac @@ -1133,7 +1133,7 @@ AC_MSG_CHECKING(for native corosync) COROSYNC_LIBS= CS_USES_LIBQB=0 -PCMK_SERVICE_ID=9 +PCMK_SERVICE_ID=10 LCRSODIR=libdir if test SUPPORT_CS = no; then So Pacemaker 6.4 is talking on slot 10, while Pacemaker == 6.4 is using slot 9. This is why the two versions cannot see each other :-( Im very sorry. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] 1.1.8 not compatible with 1.1.7?
On 10/05/2013, at 6:05 AM, Rainer Brestan rainer.bres...@gmx.net wrote: Hi Andrew, yes, this clarifies a lot. Seems that it is really time to throw away the plugin. The CMAN solution wont be able (at least from the documentation) to attach new nodes without reconfiguration and restart CMAN on the existing nodes That doesn't sound right to me. CC'ing Fabio who should know more (or who does) . The alternative is corosync 2.x. Not on RHEL6 - unless you're building things yourself of course. ClusterLabs has a quite long list of corosync versions from branch 2.0, 2.1, 2.2 und 2.3. Beside the current reported issue of version 2.3, which version does ClusterLabs use for its regression test. I found somewhere a note for 2.1.x, is this true ? According to rpm, I've been using: Source RPM : corosync-2.3.0-1.1.2c22.el7.src.rpm and Source RPM : corosync-2.3.0-1.fc18.src.rpm Rainer Gesendet: Donnerstag, 09. Mai 2013 um 04:31 Uhr Von: Andrew Beekhof and...@beekhof.net An: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Betreff: Re: [Pacemaker] 1.1.8 not compatible with 1.1.7? On 08/05/2013, at 4:53 PM, Andrew Beekhof and...@beekhof.net wrote: On 08/05/2013, at 4:08 PM, Andrew Beekhof and...@beekhof.net wrote: On 03/05/2013, at 8:46 PM, Rainer Brestan rainer.bres...@gmx.net wrote: Now i have all the logs for some combinations. Corosync: 1.4.1-7 for all the tests on all nodes Base is always fresh installation of each node with all packages equal except pacemaker version. int2node1 node id: 1743917066 int2node2 node id: 1777471498 In each ZIP file log from both nodes and the status output of crm_mon and cibadmin -Q is included. 1.) 1.1.8-4 attaches to running 1.1.7-6 cluster https://www.dropbox.com/s/06oyrle4ny47uv9/attach_1.1.8-4_to_1.1.7-6.zip Result: join outstanding 2.) 1.1.9-2 attaches to running 1.1.7-6 cluster https://www.dropbox.com/s/fv5kcm2yb5jz56z/attach_1.1.9-2_to_1.1.7-6.zip Result: join outstanding Neither side is seeing anything from the other, which is very unexpected. I notice you're using the plugin... which acts as a message router. So I suspect something in there has changed (though I'm at a loss to say what) and that cman based clusters are unaffected. Confirmed, cman clusters are unaffected. I'm yet to work out what changed in the plugin. I worked it out... The Red Hat changelog for 1.1.8-2 originally contained +- Cman is the only supported membership quorum provider, do not ship the corosync plugin When this decision was reversed (when I realised no-one was seeing the ERROR logs indicating it was going away), I neglected to re-instate the following distro specific patch (which avoided conflicts between the ID used by CMAN and Pacemaker): diff --git a/configure.ac b/configure.ac index a3784d5..dafa9e2 100644 --- a/configure.ac +++ b/configure.ac @@ -1133,7 +1133,7 @@ AC_MSG_CHECKING(for native corosync) COROSYNC_LIBS= CS_USES_LIBQB=0 -PCMK_SERVICE_ID=9 +PCMK_SERVICE_ID=10 LCRSODIR=$libdir if test $SUPPORT_CS = no; then So Pacemaker 6.4 is talking on slot 10, while Pacemaker == 6.4 is using slot 9. This is why the two versions cannot see each other :-( I'm very sorry. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] SmartOS / illumos
Hello, I'm trying to compile pacemaker on SmartOS and having error during make. Does anyone has already successfully compiled on SmartOS? Or Can Someone help to solve the problem I'm having now. Thank you, Glib version : glib2-2.34.3 GCC version: [root@web01 ~/pacemaker]# gcc -v Using built-in specs. COLLECT_GCC=/opt/local/gcc47/bin/gcc COLLECT_LTO_WRAPPER=/opt/local/gcc47/libexec/gcc/x86_64-sun-solaris2.11/4.7.2/lto-wrapper Target: x86_64-sun-solaris2.11 Configured with: ../gcc-4.7.2/configure --enable-languages='c go fortran c++' --enable-shared --enable-long-long --with-local-prefix=/opt/local/gcc47 --enable-libssp --enable-threads=posix --with-boot-ldflags='-static-libstdc++ -static-libgcc -Wl,-R/opt/local/lib ' --disable-nls --enable-__cxa_atexit --with-gxx-include-dir=/opt/local/gcc47/include/c++/ --without-gnu-ld --with-ld=/usr/bin/ld --with-gnu-as --with-as=/opt/local/bin/gas --prefix=/opt/local/gcc47 --build=x86_64-sun-solaris2.11 --host=x86_64-sun-solaris2.11 --infodir=/opt/local/gcc47/info --mandir=/opt/local/gcc47/man Thread model: posix gcc version 4.7.2 (GCC) Here is the error log of make +++ gmake[2]: Entering directory `/root/pacemaker/lib/common' CC ipc.lo In file included from ../../include/crm_internal.h:26:0, from ipc.c:19: ../../include/portability.h:81:16: error: expected '=', ',', ';', 'asm' or '__attribute__' before '*' token ../../include/portability.h:86:1: error: expected identifier or '(' before '}' token ../../include/portability.h:86:1: error: useless type name in empty declaration [-Werror] ../../include/portability.h:98:1: error: static declaration of 'g_hash_table_get_values' follows non-static declaration In file included from /opt/local/include/glib/glib-2.0/glib.h:52:0, from ../../include/portability.h:79, from ../../include/crm_internal.h:26, from ipc.c:19: /opt/local/include/glib/glib-2.0/glib/ghash.h:101:13: note: previous declaration of 'g_hash_table_get_values' was here In file included from ../../include/crm_internal.h:26:0, from ipc.c:19: ../../include/portability.h: In function 'g_hash_table_nth_data': ../../include/portability.h:111:13: error: 'GHashTableIter' has no member named 'lpc' ../../include/portability.h:111:28: error: 'GHashTableIter' has no member named 'nth' ../../include/portability.h:112:13: error: 'GHashTableIter' has no member named 'key' ../../include/portability.h:113:13: error: 'GHashTableIter' has no member named 'value' ../../include/portability.h: At top level: ../../include/portability.h:121:1: error: static declaration of 'g_hash_table_iter_init' follows non-static declaration In file included from /opt/local/include/glib/glib-2.0/glib.h:52:0, from ../../include/portability.h:79, from ../../include/crm_internal.h:26, from ipc.c:19: /opt/local/include/glib/glib-2.0/glib/ghash.h:103:13: note: previous declaration of 'g_hash_table_iter_init' was here In file included from ../../include/crm_internal.h:26:0, from ipc.c:19: ../../include/portability.h: In function 'g_hash_table_iter_init': ../../include/portability.h:123:9: error: 'GHashTableIter' has no member named 'hash' ../../include/portability.h:124:9: error: 'GHashTableIter' has no member named 'nth' ../../include/portability.h:125:9: error: 'GHashTableIter' has no member named 'lpc' ../../include/portability.h:126:9: error: 'GHashTableIter' has no member named 'key' ../../include/portability.h:127:9: error: 'GHashTableIter' has no member named 'value' ../../include/portability.h: At top level: ../../include/portability.h:131:1: error: static declaration of 'g_hash_table_iter_next' follows non-static declaration In file included from /opt/local/include/glib/glib-2.0/glib.h:52:0, from ../../include/portability.h:79, from ../../include/crm_internal.h:26, from ipc.c:19: /opt/local/include/glib/glib-2.0/glib/ghash.h:105:13: note: previous declaration of 'g_hash_table_iter_next' was here In file included from ../../include/crm_internal.h:26:0, from ipc.c:19: ../../include/portability.h: In function 'g_hash_table_iter_next': ../../include/portability.h:135:9: error: 'GHashTableIter' has no member named 'lpc' ../../include/portability.h:136:9: error: 'GHashTableIter' has no member named 'key' ../../include/portability.h:137:9: error: 'GHashTableIter' has no member named 'value' ../../include/portability.h:138:13: error: 'GHashTableIter' has no member named 'nth' ../../include/portability.h:138:43: error: 'GHashTableIter' has no member named 'hash' ../../include/portability.h:139:42: error: 'GHashTableIter' has no member named 'hash' ../../include/portability.h:140:13: error: 'GHashTableIter' has no member named 'nth' ../../include/portability.h:143:20: error: 'GHashTableIter' has no member named 'key'
Re: [Pacemaker] Using fence_sanlock with pacemaker 1.1.8-7.el6
On Thu, May 9, 2013 at 3:12 AM, Andrew Beekhof and...@beekhof.net wrote: On 08/05/2013, at 11:52 PM, John McCabe j...@johnmccabe.net wrote: Hi, I've been trying, unsuccessfully, to get fence_sanlock running as a fence device within pacemaker 1.1.8 in Centos64. I've set the pcmk_host_argument=host_id You mean the literal string host_id or the true value? Might be better to send us the actual config you're using along with log files. Also, what does fence_sanlock -o metadata say? [root@fee ~]# fence_sanlock -o metadata ?xml version=1.0 ? resource-agent name=fence_sanlock shortdesc=Fence agent for watchdog and shared storage longdesc fence_sanlock is an i/o fencing agent that uses the watchdog device to reset nodes. Shared storage (block or file) is used by sanlock to ensure that fenced nodes are reset, and to notify partitioned nodes that they need to be reset. /longdesc vendor-urlhttp://www.redhat.com//vendor-url parameters parameter name=action unique=0 required=1 getopt mixed=-o lt;actiongt; / content type=string default=off / shortdesc lang=enFencing Action/shortdesc /parameter parameter name=path unique=0 required=1 getopt mixed=-p lt;actiongt; / content type=string / shortdesc lang=enPath to sanlock shared storage/shortdesc /parameter parameter name=host_id unique=0 required=1 getopt mixed=-i lt;actiongt; / content type=string / shortdesc lang=enHost id for sanlock (1-128)/shortdesc /parameter /parameters actions action name=on / action name=off / action name=status / action name=metadata / action name=sanlock_init / /actions /resource-agent I'd set the pcmk_host_argument to the literal string since the errors thrown complain that the host_id param is missing and I'd assumed that the with the pcmk_host_map also set we'd end up passing the mapped id rather than the hostname (attached an archive with /var/log/messages covering when the stonith device is added with pcs): May 10 01:33:42 fee stonith-ng[10542]: warning: log_operation: st-sanlock:10725 [ host_id argument required ] pcs -f stonith_cfg_sanlock stonith create st-sanlock fence_sanlock path=/dev/mapper/vg_shared-lv_sanlock pcmk_host_list=fee-1 fi-1 pcmk_host_map=fee-1:1;fi-1:2 pcmk_host_argument=host_id Taking a closer look at the fence_sanlock script itself (from fence-sanlock-2.6-2.el6.x86_64) and it doesn't appear to support a monitor operation, which led me to suspect that it didn't actually support being used with pacemaker, at least without possibly having to update the agent script. Setting pcmk_monitor_action=status didn't help either as it still fails requesting the host_id be set. I ended up getting sbd up and running as an interim solution - but I'd really like to be able to stick with a fencing agent thats got a future in RHEL where possible. Is the expectation/intention that all fencing agents be compatible with pacemaker? along with pcmk_host_map, pcmk_host_list and path. But it later complains that its unable to process the monitor operation since no host_id is provided.. I'd have assumed that the pcmk_host_argument would have performed a mapping, but it seems not to in the case of the monitor operation. When pcs stonith list returned fence_sanlock in its list of agents I'd hoped it was going to be straightforward. Is fence_sanlock actually compatible with pacemaker, and has anyone had success using it with pacemaker rather than just directly within CMAN? Yours confused, John ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org messages.sanlock.gz Description: GNU Zip compressed data ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] SmartOS / illumos
Looks like you need https://github.com/beekhof/pacemaker/commit/629aa36 I recall making that change once before but it got lost somehow. On 10/05/2013, at 10:02 AM, Dalho PARK dp...@smart-trade.net wrote: Hello, I’m trying to compile pacemaker on SmartOS and having error during make. Does anyone has already successfully compiled on SmartOS? Or Can Someone help to solve the problem I’m having now. Thank you, Glib version : glib2-2.34.3 GCC version: [root@web01 ~/pacemaker]# gcc -v Using built-in specs. COLLECT_GCC=/opt/local/gcc47/bin/gcc COLLECT_LTO_WRAPPER=/opt/local/gcc47/libexec/gcc/x86_64-sun-solaris2.11/4.7.2/lto-wrapper Target: x86_64-sun-solaris2.11 Configured with: ../gcc-4.7.2/configure --enable-languages='c go fortran c++' --enable-shared --enable-long-long --with-local-prefix=/opt/local/gcc47 --enable-libssp --enable-threads=posix --with-boot-ldflags='-static-libstdc++ -static-libgcc -Wl,-R/opt/local/lib ' --disable-nls --enable-__cxa_atexit --with-gxx-include-dir=/opt/local/gcc47/include/c++/ --without-gnu-ld --with-ld=/usr/bin/ld --with-gnu-as --with-as=/opt/local/bin/gas --prefix=/opt/local/gcc47 --build=x86_64-sun-solaris2.11 --host=x86_64-sun-solaris2.11 --infodir=/opt/local/gcc47/info --mandir=/opt/local/gcc47/man Thread model: posix gcc version 4.7.2 (GCC) Here is the error log of make +++ gmake[2]: Entering directory `/root/pacemaker/lib/common' CC ipc.lo In file included from ../../include/crm_internal.h:26:0, from ipc.c:19: ../../include/portability.h:81:16: error: expected '=', ',', ';', 'asm' or '__attribute__' before '*' token ../../include/portability.h:86:1: error: expected identifier or '(' before '}' token ../../include/portability.h:86:1: error: useless type name in empty declaration [-Werror] ../../include/portability.h:98:1: error: static declaration of 'g_hash_table_get_values' follows non-static declaration In file included from /opt/local/include/glib/glib-2.0/glib.h:52:0, from ../../include/portability.h:79, from ../../include/crm_internal.h:26, from ipc.c:19: /opt/local/include/glib/glib-2.0/glib/ghash.h:101:13: note: previous declaration of 'g_hash_table_get_values' was here In file included from ../../include/crm_internal.h:26:0, from ipc.c:19: ../../include/portability.h: In function 'g_hash_table_nth_data': ../../include/portability.h:111:13: error: 'GHashTableIter' has no member named 'lpc' ../../include/portability.h:111:28: error: 'GHashTableIter' has no member named 'nth' ../../include/portability.h:112:13: error: 'GHashTableIter' has no member named 'key' ../../include/portability.h:113:13: error: 'GHashTableIter' has no member named 'value' ../../include/portability.h: At top level: ../../include/portability.h:121:1: error: static declaration of 'g_hash_table_iter_init' follows non-static declaration In file included from /opt/local/include/glib/glib-2.0/glib.h:52:0, from ../../include/portability.h:79, from ../../include/crm_internal.h:26, from ipc.c:19: /opt/local/include/glib/glib-2.0/glib/ghash.h:103:13: note: previous declaration of 'g_hash_table_iter_init' was here In file included from ../../include/crm_internal.h:26:0, from ipc.c:19: ../../include/portability.h: In function 'g_hash_table_iter_init': ../../include/portability.h:123:9: error: 'GHashTableIter' has no member named 'hash' ../../include/portability.h:124:9: error: 'GHashTableIter' has no member named 'nth' ../../include/portability.h:125:9: error: 'GHashTableIter' has no member named 'lpc' ../../include/portability.h:126:9: error: 'GHashTableIter' has no member named 'key' ../../include/portability.h:127:9: error: 'GHashTableIter' has no member named 'value' ../../include/portability.h: At top level: ../../include/portability.h:131:1: error: static declaration of 'g_hash_table_iter_next' follows non-static declaration In file included from /opt/local/include/glib/glib-2.0/glib.h:52:0, from ../../include/portability.h:79, from ../../include/crm_internal.h:26, from ipc.c:19: /opt/local/include/glib/glib-2.0/glib/ghash.h:105:13: note: previous declaration of 'g_hash_table_iter_next' was here In file included from ../../include/crm_internal.h:26:0, from ipc.c:19: ../../include/portability.h: In function 'g_hash_table_iter_next': ../../include/portability.h:135:9: error: 'GHashTableIter' has no member named 'lpc' ../../include/portability.h:136:9: error: 'GHashTableIter' has no member named 'key' ../../include/portability.h:137:9: error: 'GHashTableIter' has no member named 'value' ../../include/portability.h:138:13: error: 'GHashTableIter' has no member named 'nth' ../../include/portability.h:138:43:
Re: [Pacemaker] Using fence_sanlock with pacemaker 1.1.8-7.el6
Forgot to include my cman config... ?xml version=1.0? cluster config_version=1 name=ngwcluster logging debug=off/ clusternodes clusternode name=fee-1 nodeid=1 fence method name=pcmk-redirect device name=pcmk port=fee-1/ /method /fence /clusternode clusternode name=fi-1 nodeid=2 fence method name=pcmk-redirect device name=pcmk port=fi-1/ /method /fence /clusternode /clusternodes fencedevices fencedevice name=pcmk agent=fence_pcmk/ /fencedevices cman two_node=1 expected_votes=1 /cman /cluster On Fri, May 10, 2013 at 2:18 AM, John McCabe j...@johnmccabe.net wrote: On Thu, May 9, 2013 at 3:12 AM, Andrew Beekhof and...@beekhof.net wrote: On 08/05/2013, at 11:52 PM, John McCabe j...@johnmccabe.net wrote: Hi, I've been trying, unsuccessfully, to get fence_sanlock running as a fence device within pacemaker 1.1.8 in Centos64. I've set the pcmk_host_argument=host_id You mean the literal string host_id or the true value? Might be better to send us the actual config you're using along with log files. Also, what does fence_sanlock -o metadata say? [root@fee ~]# fence_sanlock -o metadata ?xml version=1.0 ? resource-agent name=fence_sanlock shortdesc=Fence agent for watchdog and shared storage longdesc fence_sanlock is an i/o fencing agent that uses the watchdog device to reset nodes. Shared storage (block or file) is used by sanlock to ensure that fenced nodes are reset, and to notify partitioned nodes that they need to be reset. /longdesc vendor-urlhttp://www.redhat.com//vendor-url parameters parameter name=action unique=0 required=1 getopt mixed=-o lt;actiongt; / content type=string default=off / shortdesc lang=enFencing Action/shortdesc /parameter parameter name=path unique=0 required=1 getopt mixed=-p lt;actiongt; / content type=string / shortdesc lang=enPath to sanlock shared storage/shortdesc /parameter parameter name=host_id unique=0 required=1 getopt mixed=-i lt;actiongt; / content type=string / shortdesc lang=enHost id for sanlock (1-128)/shortdesc /parameter /parameters actions action name=on / action name=off / action name=status / action name=metadata / action name=sanlock_init / /actions /resource-agent I'd set the pcmk_host_argument to the literal string since the errors thrown complain that the host_id param is missing and I'd assumed that the with the pcmk_host_map also set we'd end up passing the mapped id rather than the hostname (attached an archive with /var/log/messages covering when the stonith device is added with pcs): May 10 01:33:42 fee stonith-ng[10542]: warning: log_operation: st-sanlock:10725 [ host_id argument required ] pcs -f stonith_cfg_sanlock stonith create st-sanlock fence_sanlock path=/dev/mapper/vg_shared-lv_sanlock pcmk_host_list=fee-1 fi-1 pcmk_host_map=fee-1:1;fi-1:2 pcmk_host_argument=host_id Taking a closer look at the fence_sanlock script itself (from fence-sanlock-2.6-2.el6.x86_64) and it doesn't appear to support a monitor operation, which led me to suspect that it didn't actually support being used with pacemaker, at least without possibly having to update the agent script. Setting pcmk_monitor_action=status didn't help either as it still fails requesting the host_id be set. I ended up getting sbd up and running as an interim solution - but I'd really like to be able to stick with a fencing agent thats got a future in RHEL where possible. Is the expectation/intention that all fencing agents be compatible with pacemaker? along with pcmk_host_map, pcmk_host_list and path. But it later complains that its unable to process the monitor operation since no host_id is provided.. I'd have assumed that the pcmk_host_argument would have performed a mapping, but it seems not to in the case of the monitor operation. When pcs stonith list returned fence_sanlock in its list of agents I'd hoped it was going to be straightforward. Is fence_sanlock actually compatible with pacemaker, and has anyone had success using it with pacemaker rather than just directly within CMAN? Yours confused, John ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started:
Re: [Pacemaker] Using fence_sanlock with pacemaker 1.1.8-7.el6
On 10/05/2013, at 11:18 AM, John McCabe j...@johnmccabe.net wrote: On Thu, May 9, 2013 at 3:12 AM, Andrew Beekhof and...@beekhof.net wrote: On 08/05/2013, at 11:52 PM, John McCabe j...@johnmccabe.net wrote: Hi, I've been trying, unsuccessfully, to get fence_sanlock running as a fence device within pacemaker 1.1.8 in Centos64. I've set the pcmk_host_argument=host_id You mean the literal string host_id or the true value? Might be better to send us the actual config you're using along with log files. Also, what does fence_sanlock -o metadata say? [root@fee ~]# fence_sanlock -o metadata ?xml version=1.0 ? resource-agent name=fence_sanlock shortdesc=Fence agent for watchdog and shared storage longdesc fence_sanlock is an i/o fencing agent that uses the watchdog device to reset nodes. Shared storage (block or file) is used by sanlock to ensure that fenced nodes are reset, and to notify partitioned nodes that they need to be reset. /longdesc vendor-urlhttp://www.redhat.com//vendor-url parameters parameter name=action unique=0 required=1 getopt mixed=-o lt;actiongt; / content type=string default=off / shortdesc lang=enFencing Action/shortdesc /parameter parameter name=path unique=0 required=1 getopt mixed=-p lt;actiongt; / content type=string / shortdesc lang=enPath to sanlock shared storage/shortdesc /parameter parameter name=host_id unique=0 required=1 getopt mixed=-i lt;actiongt; / content type=string / shortdesc lang=enHost id for sanlock (1-128)/shortdesc /parameter /parameters actions action name=on / action name=off / action name=status / action name=metadata / action name=sanlock_init / /actions /resource-agent I'd set the pcmk_host_argument to the literal string since the errors thrown complain that the host_id param is missing and I'd assumed that the with the pcmk_host_map also set we'd end up passing the mapped id rather than the hostname (attached an archive with /var/log/messages covering when the stonith device is added with pcs): May 10 01:33:42 fee stonith-ng[10542]: warning: log_operation: st-sanlock:10725 [ host_id argument required ] pcs -f stonith_cfg_sanlock stonith create st-sanlock fence_sanlock path=/dev/mapper/vg_shared-lv_sanlock pcmk_host_list=fee-1 fi-1 pcmk_host_map=fee-1:1;fi-1:2 pcmk_host_argument=host_id Taking a closer look at the fence_sanlock script itself (from fence-sanlock-2.6-2.el6.x86_64) and it doesn't appear to support a monitor operation, which led me to suspect that it didn't actually support being used with pacemaker, at least without possibly having to update the agent script. Correct Setting pcmk_monitor_action=status didn't help either as it still fails requesting the host_id be set. Also correct, since we're testing the agent itself, not a specific host_id. I ended up getting sbd up and running as an interim solution - but I'd really like to be able to stick with a fencing agent thats got a future in RHEL where possible. Is the expectation/intention that all fencing agents be compatible with pacemaker? Yes. There are a couple of ways in which this agent is not conforming to Red Hat's API, could you file a bug against it and CC me please? along with pcmk_host_map, pcmk_host_list and path. But it later complains that its unable to process the monitor operation since no host_id is provided.. I'd have assumed that the pcmk_host_argument would have performed a mapping, but it seems not to in the case of the monitor operation. When pcs stonith list returned fence_sanlock in its list of agents I'd hoped it was going to be straightforward. Is fence_sanlock actually compatible with pacemaker, and has anyone had success using it with pacemaker rather than just directly within CMAN? Yours confused, John ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org messages.sanlock.gz___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started:
Re: [Pacemaker] resource starts but then fails right away
On 10/05/2013, at 12:26 AM, Brian J. Murrell br...@interlinx.bc.ca wrote: I do see the: May 7 02:37:32 node1 crmd[16836]:error: print_elem: Aborting transition, action lost: [Action 5]: In-flight (id: testfs-resource1_monitor_0, loc: node1, priority: 0) in the log. Is that the root cause of the problem? Ordinarily I'd have said yes, but I also see: May 7 02:36:16 node1 crmd[16836]: info: delete_resource: Removing resource testfs-resource1 for 18002_crm_resource (internal) on node1 May 7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation monitor[8] on ocf::Target::testfs-resource1 for client 16836 still running, flush delayed May 7 02:36:16 node1 crmd[16836]: info: lrm_remove_deleted_op: Removing op testfs-resource1_monitor_0:8 for deleted resource testfs-resource1 So apparently a badly timed cleanup was run. Did you do that or was it the crm shell? If so, what's that trying to tell me, exactly? If not, what is the cause of the problem? It really can't be the RA timing out since I give the monitor operation a 60 second timeout and the status action of the RA only take a few seconds at most to run and is not really an operation that can get blocked on anything. It's effectively the grepping of a file. If the machine is heavily loaded, or just very busy with file I/O, that can still take quite a long time. I've seen IPaddr monitor actions take over a minute for example. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] failure handling on a cloned resource
On 07/05/2013, at 5:15 PM, Johan Huysmans johan.huysm...@inuits.be wrote: Hi, I only keep a couple of pe-input file, and that pe-inpurt-1 version was already overwritten. I redid my tests as describe in my previous mails. At the end of the test it was again written to pe-input1, which is included as attachment. Perfect. Basically the PE doesn't know how to correctly recognise that d_tomcat_monitor_15000 needs to be processed after d_tomcat_last_failure_0: lrm_rsc_op id=d_tomcat_monitor_15000 operation_key=d_tomcat_monitor_15000 operation=monitor crm-debug-origin=do_update_resource crm_feature_set=3.0.7 transition-key=18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 transition-magic=0:0;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 call-id=44 rc-code=0 op-status=0 interval=15000 last-rc-change=1367910303 exec-time=0 queue-time=0 op-digest=0c738dfc69f09a62b7ebf32344fddcf6/ lrm_rsc_op id=d_tomcat_last_failure_0 operation_key=d_tomcat_monitor_15000 operation=monitor crm-debug-origin=do_update_resource crm_feature_set=3.0.7 transition-key=18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 transition-magic=0:1;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 call-id=44 rc-code=1 op-status=0 interval=15000 last-rc-change=1367909258 exec-time=0 queue-time=0 op-digest=0c738dfc69f09a62b7ebf32344fddcf6/ which would allow it to recognise that the resource is healthy one again. I'll see what I can do... gr. Johan On 2013-05-07 04:08, Andrew Beekhof wrote: I have a much clearer idea of the problem you're seeing now, thankyou. Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ? On 03/05/2013, at 10:40 PM, Johan Huysmans johan.huysm...@inuits.be wrote: Hi, Below you can see my setup and my test, this shows that my cloned resource with on-fail=block does not recover automatically. My Setup: # rpm -aq | grep -i pacemaker pacemaker-libs-1.1.9-1512.el6.i686 pacemaker-cluster-libs-1.1.9-1512.el6.i686 pacemaker-cli-1.1.9-1512.el6.i686 pacemaker-1.1.9-1512.el6.i686 # crm configure show node CSE-1 node CSE-2 primitive d_tomcat ocf:ntc:tomcat \ op monitor interval=15s timeout=510s on-fail=block \ op start interval=0 timeout=510s \ params instance_name=NMS monitor_use_ssl=no monitor_urls=/cse/health monitor_timeout=120 \ meta migration-threshold=1 primitive ip_11 ocf:heartbeat:IPaddr2 \ op monitor interval=10s \ params broadcast=172.16.11.31 ip=172.16.11.31 nic=bond0.111 iflabel=ha \ meta migration-threshold=1 failure-timeout=10 primitive ip_19 ocf:heartbeat:IPaddr2 \ op monitor interval=10s \ params broadcast=172.18.19.31 ip=172.18.19.31 nic=bond0.119 iflabel=ha \ meta migration-threshold=1 failure-timeout=10 group svc-cse ip_19 ip_11 clone cl_tomcat d_tomcat colocation colo_tomcat inf: svc-cse cl_tomcat order order_tomcat inf: cl_tomcat svc-cse property $id=cib-bootstrap-options \ dc-version=1.1.9-1512.el6-2a917dd \ cluster-infrastructure=cman \ pe-warn-series-max=9 \ no-quorum-policy=ignore \ stonith-enabled=false \ pe-input-series-max=9 \ pe-error-series-max=9 \ last-lrm-refresh=1367582088 Currently only 1 node is available, CSE-1. This is how I am currently testing my setup: = Starting point: Everything up and running # crm resource status Resource Group: svc-cse ip_19(ocf::heartbeat:IPaddr2):Started ip_11(ocf::heartbeat:IPaddr2):Started Clone Set: cl_tomcat [d_tomcat] Started: [ CSE-1 ] Stopped: [ d_tomcat:1 ] = Causing failure: Change system so tomcat is running but has a failure (in attachment step_2.log) # crm resource status Resource Group: svc-cse ip_19(ocf::heartbeat:IPaddr2):Stopped ip_11(ocf::heartbeat:IPaddr2):Stopped Clone Set: cl_tomcat [d_tomcat] d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED Stopped: [ d_tomcat:1 ] = Fixing failure: Revert system so tomcat is running without failure (in attachment step_3.log) # crm resource status Resource Group: svc-cse ip_19(ocf::heartbeat:IPaddr2):Stopped ip_11(ocf::heartbeat:IPaddr2):Stopped Clone Set: cl_tomcat [d_tomcat] d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED Stopped: [ d_tomcat:1 ] As you can see in the logs the OCF script doesn't return any failure. This is noticed by pacemaker, however it doesn't reflect in crm_mon and it doesn't start the depending resources. Gr. Johan On 2013-05-03 03:04, Andrew Beekhof wrote: On 02/05/2013, at 5:45 PM, Johan Huysmans johan.huysm...@inuits.be wrote: On 2013-05-01 05:48, Andrew Beekhof wrote: On 17/04/2013, at 9:54 PM, Johan Huysmans johan.huysm...@inuits.be wrote: Hi All, I'm trying to setup a specific configuration in our cluster, however I'm struggling with my configuration. This is what I'm trying to achieve: On both nodes of the
Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8
On Fri, May 10, 2013 at 6:21 AM, Andrew Beekhof and...@beekhof.net wrote: On 08/05/2013, at 9:16 PM, pavan tc pavan...@gmail.com wrote: Hi Andrew, Thanks much for looking into this. I have some queries inline. Hi, I have a two-node cluster with STONITH disabled. Thats not a good idea. Ok. I'll try and configure stonith. I am still running with the pcmk plugin as opposed to the recommended CMAN plugin. On rhel6? Yes. With 1.1.8, I see some messages (appended to this mail) once in a while. I do not understand some keywords here - There is a Leave action. I am not sure what that is. It means the cluster is not going to change the state of the resource. Why did the cluster execute the Leave action at this point? Is there some other error that triggers this? Or is it a benign message? And, there is a CIB update failure that leads to a RECOVER action. There is a message that says the RECOVER action is not supported. Finally this leads to a stop and start of my resource. Well, and also Pacemaker's crmd process. My guess... the node is overloaded which is causing the cib queries to time out. Is there a cib query timeout value that I can set? I was earlier getting the TOTEM timeout. So, I set the token to a larger value (5 seconds) in corosync.conf and things were much better. But now, I have started hitting this problem. Thanks, Pavan I can copy the crm configure show output, but nothing special there. Thanks much. Pavan PS: The resource vha-bcd94724-3ec0-4a8d-8951-9d27be3a6acb is stale. The underlying device that represents this resource has been removed. However, the resource is still part of the CIB. All errors related to that resource can be ignored. But can this cause a node to be stopped/fenced? Not if fencing is disabled. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8
Is there a cib query timeout value that I can set? I was earlier getting the TOTEM timeout. So, I set the token to a larger value (5 seconds) in corosync.conf and things were much better. But now, I have started hitting this problem. I'll experiment with the cibadmin -t (--timeout) option to see if it helps. As I can see from the code, the default seems to be 30 ms. Is there a widely used default for systems with a high load or is it found out the hard way for each setup? Pavan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8
On 10/05/2013, at 1:44 PM, pavan tc pavan...@gmail.com wrote: On Fri, May 10, 2013 at 6:21 AM, Andrew Beekhof and...@beekhof.net wrote: On 08/05/2013, at 9:16 PM, pavan tc pavan...@gmail.com wrote: Hi Andrew, Thanks much for looking into this. I have some queries inline. Hi, I have a two-node cluster with STONITH disabled. Thats not a good idea. Ok. I'll try and configure stonith. I am still running with the pcmk plugin as opposed to the recommended CMAN plugin. On rhel6? Yes. With 1.1.8, I see some messages (appended to this mail) once in a while. I do not understand some keywords here - There is a Leave action. I am not sure what that is. It means the cluster is not going to change the state of the resource. Why did the cluster execute the Leave action at this point? There is no Leave action being executed. We are simply logging that nothing is going to happen to that resource - it is in the state that we exepect/want. Is there some other error that triggers this? Or is it a benign message? And, there is a CIB update failure that leads to a RECOVER action. There is a message that says the RECOVER action is not supported. Finally this leads to a stop and start of my resource. Well, and also Pacemaker's crmd process. My guess... the node is overloaded which is causing the cib queries to time out. Is there a cib query timeout value that I can set? No. You can set the batch-limit property though, this reduces the rate at which CIB operations are attempted I was earlier getting the TOTEM timeout. So, I set the token to a larger value (5 seconds) in corosync.conf and things were much better. But now, I have started hitting this problem. Thanks, Pavan I can copy the crm configure show output, but nothing special there. Thanks much. Pavan PS: The resource vha-bcd94724-3ec0-4a8d-8951-9d27be3a6acb is stale. The underlying device that represents this resource has been removed. However, the resource is still part of the CIB. All errors related to that resource can be ignored. But can this cause a node to be stopped/fenced? Not if fencing is disabled. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] crmd restart due to internal error - pacemaker 1.1.8
I'll experiment with the cibadmin -t (--timeout) option to see if it helps. As I can see from the code, the default seems to be 30 ms. Is there a widely used default for systems with a high load or is it found out the hard way for each setup? Easier said than done. Can someone help with how to use the --timeout option in cibadmin? Pavan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] failure handling on a cloned resource
Fixed! https://github.com/beekhof/pacemaker/commit/d87de1b On 10/05/2013, at 11:59 AM, Andrew Beekhof and...@beekhof.net wrote: On 07/05/2013, at 5:15 PM, Johan Huysmans johan.huysm...@inuits.be wrote: Hi, I only keep a couple of pe-input file, and that pe-inpurt-1 version was already overwritten. I redid my tests as describe in my previous mails. At the end of the test it was again written to pe-input1, which is included as attachment. Perfect. Basically the PE doesn't know how to correctly recognise that d_tomcat_monitor_15000 needs to be processed after d_tomcat_last_failure_0: lrm_rsc_op id=d_tomcat_monitor_15000 operation_key=d_tomcat_monitor_15000 operation=monitor crm-debug-origin=do_update_resource crm_feature_set=3.0.7 transition-key=18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 transition-magic=0:0;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 call-id=44 rc-code=0 op-status=0 interval=15000 last-rc-change=1367910303 exec-time=0 queue-time=0 op-digest=0c738dfc69f09a62b7ebf32344fddcf6/ lrm_rsc_op id=d_tomcat_last_failure_0 operation_key=d_tomcat_monitor_15000 operation=monitor crm-debug-origin=do_update_resource crm_feature_set=3.0.7 transition-key=18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 transition-magic=0:1;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89 call-id=44 rc-code=1 op-status=0 interval=15000 last-rc-change=1367909258 exec-time=0 queue-time=0 op-digest=0c738dfc69f09a62b7ebf32344fddcf6/ which would allow it to recognise that the resource is healthy one again. I'll see what I can do... gr. Johan On 2013-05-07 04:08, Andrew Beekhof wrote: I have a much clearer idea of the problem you're seeing now, thankyou. Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ? On 03/05/2013, at 10:40 PM, Johan Huysmans johan.huysm...@inuits.be wrote: Hi, Below you can see my setup and my test, this shows that my cloned resource with on-fail=block does not recover automatically. My Setup: # rpm -aq | grep -i pacemaker pacemaker-libs-1.1.9-1512.el6.i686 pacemaker-cluster-libs-1.1.9-1512.el6.i686 pacemaker-cli-1.1.9-1512.el6.i686 pacemaker-1.1.9-1512.el6.i686 # crm configure show node CSE-1 node CSE-2 primitive d_tomcat ocf:ntc:tomcat \ op monitor interval=15s timeout=510s on-fail=block \ op start interval=0 timeout=510s \ params instance_name=NMS monitor_use_ssl=no monitor_urls=/cse/health monitor_timeout=120 \ meta migration-threshold=1 primitive ip_11 ocf:heartbeat:IPaddr2 \ op monitor interval=10s \ params broadcast=172.16.11.31 ip=172.16.11.31 nic=bond0.111 iflabel=ha \ meta migration-threshold=1 failure-timeout=10 primitive ip_19 ocf:heartbeat:IPaddr2 \ op monitor interval=10s \ params broadcast=172.18.19.31 ip=172.18.19.31 nic=bond0.119 iflabel=ha \ meta migration-threshold=1 failure-timeout=10 group svc-cse ip_19 ip_11 clone cl_tomcat d_tomcat colocation colo_tomcat inf: svc-cse cl_tomcat order order_tomcat inf: cl_tomcat svc-cse property $id=cib-bootstrap-options \ dc-version=1.1.9-1512.el6-2a917dd \ cluster-infrastructure=cman \ pe-warn-series-max=9 \ no-quorum-policy=ignore \ stonith-enabled=false \ pe-input-series-max=9 \ pe-error-series-max=9 \ last-lrm-refresh=1367582088 Currently only 1 node is available, CSE-1. This is how I am currently testing my setup: = Starting point: Everything up and running # crm resource status Resource Group: svc-cse ip_19(ocf::heartbeat:IPaddr2):Started ip_11(ocf::heartbeat:IPaddr2):Started Clone Set: cl_tomcat [d_tomcat] Started: [ CSE-1 ] Stopped: [ d_tomcat:1 ] = Causing failure: Change system so tomcat is running but has a failure (in attachment step_2.log) # crm resource status Resource Group: svc-cse ip_19(ocf::heartbeat:IPaddr2):Stopped ip_11(ocf::heartbeat:IPaddr2):Stopped Clone Set: cl_tomcat [d_tomcat] d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED Stopped: [ d_tomcat:1 ] = Fixing failure: Revert system so tomcat is running without failure (in attachment step_3.log) # crm resource status Resource Group: svc-cse ip_19(ocf::heartbeat:IPaddr2):Stopped ip_11(ocf::heartbeat:IPaddr2):Stopped Clone Set: cl_tomcat [d_tomcat] d_tomcat:0(ocf::ntc:tomcat):Started (unmanaged) FAILED Stopped: [ d_tomcat:1 ] As you can see in the logs the OCF script doesn't return any failure. This is noticed by pacemaker, however it doesn't reflect in crm_mon and it doesn't start the depending resources. Gr. Johan On 2013-05-03 03:04, Andrew Beekhof wrote: On 02/05/2013, at 5:45 PM, Johan Huysmans johan.huysm...@inuits.be wrote: On 2013-05-01 05:48, Andrew Beekhof wrote: On 17/04/2013, at 9:54 PM, Johan Huysmans johan.huysm...@inuits.be wrote: Hi All, I'm trying to setup a specific
Re: [Pacemaker] Behavior when crm_mon is a daemon
Hi, 2013/5/1 Andrew Beekhof and...@beekhof.net: On 19/04/2013, at 11:05 AM, Yuichi SEINO seino.clust...@gmail.com wrote: HI, 2013/4/16 Andrew Beekhof and...@beekhof.net: On 15/04/2013, at 7:42 PM, Yuichi SEINO seino.clust...@gmail.com wrote: Hi All, I look at the daemon of tools to make a new daemon. So, I have a question. When the old pid file existed, crm_mon is start as a daemon. Then, crm_mon don't update this old pid file. And, crm_mon doesn't stop. I would like to know if this behavior is correct. Some of it is, but the part about crm_mon not updating the pid file (which is probably also preventing it from stopping) is bad. I understood that it is a negative behavior. If we figure out a problem, I think that we want to fix it. Done: https://github.com/beekhof/pacemaker/commit/e549770 Plus an extra bonus: https://github.com/beekhof/pacemaker/commit/479c5cc Thanks for fixing. I could check its work. Sincerely, Yuichi -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org