Re: [Pacemaker] Node remains offline (was Node remains online)
On Thu, Mar 10, 2011 at 9:10 PM, Bart Coninckx bart.conin...@telenet.be wrote: Hi all, I have a three node cluster and while introducing the third node, it remains offline no matter what I do. Nothing you've shown here seems to indicate its offline - what leads you to that conclusion? Another symptom is that stopping openais takes forever on that node, while it is waiting for crmd to unload. The logfile shows this node (xen3) to be online however: Mar 10 20:55:26 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x6987c0 for attrd/10120 Mar 10 20:55:26 corosync [pcmk ] info: pcmk_ipc: Recorded connection 0x69cb20 for cib/10118 Mar 10 20:55:26 corosync [pcmk ] info: pcmk_ipc: Sending membership update 4100 to cib Mar 10 20:55:26 corosync [CLM ] CLM CONFIGURATION CHANGE Mar 10 20:55:26 corosync [CLM ] New Configuration: Mar 10 20:55:26 corosync [CLM ] r(0) ip(10.0.1.13) r(1) ip(10.0.2.13) Mar 10 20:55:26 corosync [CLM ] Members Left: Mar 10 20:55:26 corosync [CLM ] Members Joined: Mar 10 20:55:26 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 4104: memb=1, new=0, lost=0 Mar 10 20:55:26 corosync [pcmk ] info: pcmk_peer_update: memb: xen3 218169354 Mar 10 20:55:26 corosync [CLM ] CLM CONFIGURATION CHANGE Mar 10 20:55:26 corosync [CLM ] New Configuration: Mar 10 20:55:26 corosync [CLM ] r(0) ip(10.0.1.11) r(1) ip(10.0.2.11) Mar 10 20:55:26 corosync [CLM ] r(0) ip(10.0.1.12) r(1) ip(10.0.2.12) Mar 10 20:55:26 corosync [CLM ] r(0) ip(10.0.1.13) r(1) ip(10.0.2.13) Mar 10 20:55:26 corosync [CLM ] Members Left: Mar 10 20:55:26 corosync [CLM ] Members Joined: Mar 10 20:55:26 corosync [CLM ] r(0) ip(10.0.1.11) r(1) ip(10.0.2.11) Mar 10 20:55:26 corosync [CLM ] r(0) ip(10.0.1.12) r(1) ip(10.0.2.12) Mar 10 20:55:26 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 4104: memb=3, new=2, lost=0 Mar 10 20:55:26 corosync [pcmk ] info: update_member: Creating entry for node 184614922 born on 4104 Mar 10 20:55:26 corosync [pcmk ] info: update_member: Node 184614922/unknown is now: member Mar 10 20:55:26 corosync [pcmk ] info: pcmk_peer_update: NEW: .pending. 184614922 Mar 10 20:55:26 corosync [pcmk ] info: update_member: Creating entry for node 201392138 born on 4104 Mar 10 20:55:26 corosync [pcmk ] info: update_member: Node 201392138/unknown is now: member Mar 10 20:55:26 corosync [pcmk ] info: pcmk_peer_update: NEW: .pending. 201392138 Mar 10 20:55:26 corosync [pcmk ] info: pcmk_peer_update: MEMB: .pending. 184614922 Mar 10 20:55:26 corosync [pcmk ] info: pcmk_peer_update: MEMB: .pending. 201392138 Mar 10 20:55:26 corosync [pcmk ] info: pcmk_peer_update: MEMB: xen3 218169354 Mar 10 20:55:26 corosync [pcmk ] info: send_member_notification: Sending membership update 4104 to 1 children Mar 10 20:55:26 corosync [pcmk ] info: update_member: 0x7f4268000c80 Node 218169354 ((null)) born on: 4104 Mar 10 20:55:26 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Mar 10 20:55:26 corosync [pcmk ] info: update_member: 0x7f4268001120 Node 201392138 (xen2) born on: 3800 Mar 10 20:55:26 corosync [pcmk ] info: update_member: 0x7f4268001120 Node 201392138 now known as xen2 (was: (null)) Mar 10 20:55:26 corosync [pcmk ] info: update_member: Node xen2 now has process list: 00151312 (1381138) Mar 10 20:55:26 corosync [pcmk ] info: update_member: Node xen2 now has 1 quorum votes (was 0) Mar 10 20:55:26 corosync [pcmk ] info: send_member_notification: Sending membership update 4104 to 1 children Mar 10 20:55:26 corosync [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2) Mar 10 20:55:26 xen3 cib: [10118]: notice: ais_dispatch_message: Membership 4104: quorum acquired Mar 10 20:55:26 corosync [pcmk ] info: update_member: 0x7f4268000aa0 Node 184614922 (xen1) born on: 3792 Mar 10 20:55:26 corosync [pcmk ] info: update_member: 0x7f4268000aa0 Node 184614922 now known as xen1 (was: (null)) Mar 10 20:55:26 corosync [pcmk ] info: update_member: Node xen1 now has process list: 00151312 (1381138) Mar 10 20:55:26 corosync [pcmk ] info: update_member: Node xen1 now has 1 quorum votes (was 0) Mar 10 20:55:26 corosync [pcmk ] info: update_expected_votes: Expected quorum votes 2 - 3 Mar 10 20:55:26 corosync [pcmk ] info: send_member_notification: Sending membership update 4104 to 1 children Mar 10 20:55:26 corosync [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2) Mar 10 20:55:26 corosync [TOTEM ] Marking ringid 1 interface 10.0.2.13 FAULTY - adminisrtative intervention required. Mar 10 20:55:26 corosync [pcmk ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2) Mar 10 20:55:26 xen3
[Pacemaker] shutting down pacemaker/corosync without shutting down the services
Hi! For maintenance reasons (e.g. updating pacemaker) it might be necessary to shut down pacemaker. But in such cases I want that the services to keep running. Is it possible to shut down pacemaker but keep the current service state, ie. all services should keep running on their current node. thanks Klaus ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] shutting down pacemaker/corosync without shutting down the services
On Friday 11 March 2011 11:29:47 Klaus Darilion wrote: Hi! For maintenance reasons (e.g. updating pacemaker) it might be necessary to shut down pacemaker. But in such cases I want that the services to keep running. Is it possible to shut down pacemaker but keep the current service state, ie. all services should keep running on their current node. thanks Klaus ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker crm configure rsc_defaults is-managed=false Do not forget to set is-managed=true after maintenence. -- Dr. Michael Schwartzkopff Guardinistr. 63 81375 München Tel: (0163) 172 50 98 signature.asc Description: This is a digitally signed message part. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Failure after intermittent network outage
On Thu, Mar 10, 2011 at 1:03 PM, Pavel Levshin pa...@levshin.spb.ru wrote: Hi, No, I think you've missed the point. RA did not answer at all. Monitor actions had been lost due to a cluster transition: You are incorrect. While it is true that some actions were NACK's (not lost), such NACKs do not make it into the CIB and therefor cannot be the cause of logs such as: Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: Processing failed op p-drbd-mproxy1-2:0_monitor_0 on wapgw1-log: unknown error (1) So, RA had not have a chance to answer anything. Incorrect. Apart from this, should I fake all RA's which are supposed to be unused on the particular nodes in the cluster? It seemes to me like a partial solution only. Either remove the RA, or make sure it returns something sensible when tools or configuration it needs are not available. Suppose that I want to use Virtual machine X on hardware nodes A and B, and VM Y on nodes B and C. Using DRBD, this is very common configuration, because X cannot access it's disk device on hardware node C. Currently, I must configure X and Y on every hardware node, or RA will fail with status not configured. It's not minimalistic configuration, so it is more error prone than needed. I would be happy to tell the cluster never to touch resource X on node C in this case. What do you think? No. For safety we still need to verify that X is not running on node C before we allow it to be active anywhere else. That you know the X is unavailable on C is one thing, but the cluster needs to know too. 10.03.2011 14:09, Andrew Beekhof wrote: Your basic problem is this... Mar 1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op: Processing failed op vm-mproxy1-1_monitor_0 on wapgw1-log: unknown error (1) We asked what state the resource was in and it replied arrrggg instead of not installed. Had it replied with not installed, we'd have no reason to call stop or fence the node to try and clean it up. -- Pavel Levshin //flicker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] shutting down pacemaker/corosync without shutting down the services
is-managed-default=false On Fri, Mar 11, 2011 at 11:29 AM, Klaus Darilion klaus.mailingli...@pernau.at wrote: Hi! For maintenance reasons (e.g. updating pacemaker) it might be necessary to shut down pacemaker. But in such cases I want that the services to keep running. Is it possible to shut down pacemaker but keep the current service state, ie. all services should keep running on their current node. thanks Klaus ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Failback problem with active/active cluster
On Thu, Mar 10, 2011 at 1:50 PM, Charles KOPROWSKI c...@audaxis.com wrote: Hello, I set up a 2 nodes cluster (active/active) to build an http reverse proxy/firewall. There is one vip shared by both nodes and an apache instance running on each node. Here is the configuration : node lpa \ attributes standby=off node lpb \ attributes standby=off primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=10.1.52.3 cidr_netmask=16 clusterip_hash=sourceip \ op monitor interval=30s primitive HttpProxy ocf:heartbeat:apache \ params configfile=/etc/apache2/apache2.conf \ op monitor interval=1min clone HttpProxyClone HttpProxy clone ProxyIP ClusterIP \ meta globally-unique=true clone-max=2 clone-node-max=2 colocation HttpProxy-with-ClusterIP inf: HttpProxyClone ProxyIP order HttpProxyClone-after-ProxyIP inf: ProxyIP HttpProxyClone property $id=cib-bootstrap-options \ dc-version=1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore Everything works fine at the beginning : Online: [ lpa lpb ] Clone Set: ProxyIP (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Started lpa ClusterIP:1 (ocf::heartbeat:IPaddr2): Started lpb Clone Set: HttpProxyClone Started: [ lpa lpb ] But after simulating an outage of one of the nodes with crm node standby and a recovery with crm node online, all resources stay on the same node : Online: [ lpa lpb ] Clone Set: ProxyIP (unique) ClusterIP:0 (ocf::heartbeat:IPaddr2): Started lpa ClusterIP:1 (ocf::heartbeat:IPaddr2): Started lpa Clone Set: HttpProxyClone Started: [ lpa ] Stopped: [ HttpProxy:1 ] Can you tell me if something is wrong in my configuration ? Essentially you have encountered a limitation in the allocation algorithm for clones in 1.0.x The recently released 1.1.5 has the behavior you're looking for, but the patch is far too invasive to consider back-porting to 1.0. crm_verify give me the following output : crm_verify[22555]: 2011/03/10_13:49:00 ERROR: clone_rsc_order_lh: Cannot interleave clone ProxyIP and HttpProxyClone because they do not support the same number of resources per node crm_verify[22555]: 2011/03/10_13:49:00 ERROR: clone_rsc_order_lh: Cannot interleave clone HttpProxyClone and ProxyIP because they do not support the same number of resources per node Many thanks, Regards, -- Charles KOPROWSKI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] problem with apache coming up
On Wed, Feb 16, 2011 at 4:50 PM, Testuser SST fatcha...@gmx.de wrote: Failed actions: Apache_start_0 (node=astinos, call=19, rc=1, status=complete): unknown error Any suggestions ? The apache is normal operabel with a service httpd stop/start command. Well, thats not the same script that the cluster is using, so that doesn't imply much. If I had to guess, I'd say that you probably forgot to enable the status_url on the second machine. Check the apache config files for differences. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Patch for bugzilla 2541: Shell should warn if parameter uniqueness is violated
Hi, On Thu, Mar 10, 2011 at 07:08:20PM +0100, Holger Teutsch wrote: Hi Dejan, On Thu, 2011-03-10 at 10:14 +0100, Dejan Muhamedagic wrote: Hi Holger, On Wed, Mar 09, 2011 at 07:58:02PM +0100, Holger Teutsch wrote: Hi Dejan, On Wed, 2011-03-09 at 14:00 +0100, Dejan Muhamedagic wrote: Hi Holger, In order to show the intention of the arguments clearer: Instead of def _verify(self, set_obj_semantic, set_obj_all = None): if not set_obj_all: set_obj_all = set_obj_semantic rc1 = set_obj_all.verify() if user_prefs.check_frequency != never: rc2 = set_obj_semantic.semantic_check(set_obj_all) else: rc2 = 0 return rc1 and rc2 = 1 def verify(self,cmd): usage: verify if not cib_factory.is_cib_sane(): return False return self._verify(mkset_obj(xml)) This way (always passing both args): def _verify(self, set_obj_semantic, set_obj_all): rc1 = set_obj_all.verify() if user_prefs.check_frequency != never: rc2 = set_obj_semantic.semantic_check(set_obj_all) else: rc2 = 0 return rc1 and rc2 = 1 def verify(self,cmd): usage: verify if not cib_factory.is_cib_sane(): return False set_obj_all = mkset_obj(xml) return self._verify(set_obj_all, set_obj_all) See patch set_obj_all.diff My only remaining concern is performance. Though the meta-data is cached, perhaps it will pay off to save the RAInfo instance with the element. But we can worry about that later. I can work on this as next step. I'll do some testing on really big configurations and try to gauge the impact. OK The patch makes some regression tests blow: + File /usr/lib64/python2.6/site-packages/crm/ui.py, line 1441, in verify +return self._verify(mkset_obj(xml)) + File /usr/lib64/python2.6/site-packages/crm/ui.py, line 1433, in _verify +rc2 = set_obj_semantic.semantic_check(set_obj_all) + File /usr/lib64/python2.6/site-packages/crm/cibconfig.py, line 294, in semantic_check +rc = self.__check_unique_clash(set_obj_all) + File /usr/lib64/python2.6/site-packages/crm/cibconfig.py, line 274, in __check_unique_clash +process_primitive(node, clash_dict) + File /usr/lib64/python2.6/site-packages/crm/cibconfig.py, line 259, in process_primitive +if ra_params[ name ]['unique'] == '1': +KeyError: 'OCF_CHECK_LEVEL' Can't recall why OCF_CHECK_LEVEL appears here. There must be some good explanation :) The good explanation is: Not only params are in instance_atributes ... but OCF_CHECK_LEVEL as well within operations ... Yes, it's instance_attributes within operations. The latest version no longer blows the test - semantic_check.diff Applied. Many thanks for the patch. Cheers, Dejan Regards Holger # HG changeset patch # User Holger Teutsch holger.teut...@web.de # Date 1299775617 -3600 # Branch hot # Node ID 30730ccc0aa09c3a476a18c6d95c680b3595 # Parent 9fa61ee6e35ef190f4126e163e9bfe6911e35541 Low: Shell: Rename variable set_obj_verify to set_obj_all as it always contains all objects Simplify usage of this var in [_]verify, pass to CibObjectSet.semantic_check diff -r 9fa61ee6e35e -r 30730ccc0aa0 shell/modules/cibconfig.py --- a/shell/modules/cibconfig.py Wed Mar 09 13:41:27 2011 +0100 +++ b/shell/modules/cibconfig.py Thu Mar 10 17:46:57 2011 +0100 @@ -230,7 +230,7 @@ See below for specific implementations. ''' pass -def semantic_check(self): +def semantic_check(self, set_obj_all): ''' Test objects for sanity. This is about semantics. ''' diff -r 9fa61ee6e35e -r 30730ccc0aa0 shell/modules/ui.py.in --- a/shell/modules/ui.py.in Wed Mar 09 13:41:27 2011 +0100 +++ b/shell/modules/ui.py.in Thu Mar 10 17:46:57 2011 +0100 @@ -1425,12 +1425,10 @@ set_obj = mkset_obj(*args) err_buf.release() # show them, but get an ack from the user return set_obj.edit() -def _verify(self, set_obj_semantic, set_obj_verify = None): -if not set_obj_verify: -set_obj_verify = set_obj_semantic -rc1 = set_obj_verify.verify() +def _verify(self, set_obj_semantic, set_obj_all): +rc1 = set_obj_all.verify() if user_prefs.check_frequency != never: -rc2 = set_obj_semantic.semantic_check() +rc2 = set_obj_semantic.semantic_check(set_obj_all) else: rc2 = 0 return rc1 and rc2 = 1 @@ -1438,7 +1436,8 @@ usage: verify if not cib_factory.is_cib_sane(): return False -return self._verify(mkset_obj(xml)) +set_obj_all = mkset_obj(xml) +return
Re: [Pacemaker] Node remains offline (was Node remains online)
Hi Andrew, thank you for taking the time to answer. On Friday 11 March 2011 10:57:36 Andrew Beekhof wrote: Nothing you've shown here seems to indicate its offline - what leads you to that conclusion? both crm_mon and hb_gui show this. Thank you, B. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] Drbd on a asymmetric cluster
My config is: node sql01 attributes standby=off node sql02 attributes standby=off primitive drbd_mysql ocf:linbit:drbd params drbd_resource=r0 op monitor interval=15s primitive fs_mysql ocf:heartbeat:Filesystem params device=/dev/drbd0 directory=/datastore01 fstype=ext4 primitive ip_mysql ocf:heartbeat:IPaddr2 params ip=192.168.50.20 primitive mysqld lsb:mysql group mysql fs_mysql ip_mysql mysqld ms ms_drbd_mysql drbd_mysql meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true location ms_drbd_mysql_on_both ms_drbd_mysql rule $id=ms_drbd_mysql_on_both-rule inf: #uname eq sql01 or #uname eq sql02 location mysql_pri_loc mysql inf: sql01 location mysql_alt_loc mysql 100: sql02 colocation mysql_on_drbd inf: mysql ms_drbd_mysql:Master order mysql_after_drbd inf: ms_drbd_mysql:promote mysql:start property $id=cib-bootstrap-options dc-version=1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b cluster-infrastructure=openais expected-quorum-votes=2 stonith-enabled=false symmetric-cluster=false no-quorum-policy=ignore I hope someone can explain this to me. It starts everything up in sql01 as primary and sql02 as secondary. When i put sql01 in standby it moves to sql02. But when i put sql01 online, the resources stay put on sql02. My stickiness is zero. I have tried to replace the ms_drbd_mysql_on_both with: location ms_drbd_mysql_pri_loc ms_drbd_mysql inf: sql01 location ms_drbd_mysql_sec_loc ms_drbd_mysql 100: sql02 Then it freaks out, switching rapidly between Node sql01: online drbd_mysql:0 (ocf::linbit:drbd) Slave Node sql02: online drbd_mysql:1 (ocf::linbit:drbd) Slave And Node sql01: online Node sql02: online drbd_mysql:1 (ocf::linbit:drbd) Slave I can't find any examples on how to do drbd on an asymmetric cluster Thanks ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Node remains offline (was Node remains online)
On Fri, Mar 11, 2011 at 12:12 PM, Bart Coninckx bart.conin...@telenet.be wrote: Hi Andrew, thank you for taking the time to answer. On Friday 11 March 2011 10:57:36 Andrew Beekhof wrote: Nothing you've shown here seems to indicate its offline - what leads you to that conclusion? both crm_mon and hb_gui show this. Could you show us too? Also attach the result of cibadmin -Ql ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Failure after intermittent network outage
Hi Andrew. I'm sorry, but I can not agree. Look again at the DC log. Here it says: Action lost. This is why I use this term. Then it declares every monitor action as it has failed with rc=1, which is not true. Note that even those actions which were directed to inexistent RA are listed as failed with rc=1. (DRBD is not installed on target server, so there is no ocf:linbit:drbd). Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: Timer popped (timeout=2, abort_level=100, complete=false) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting transition, action lost: [Action 30]: In-flight (id: ilo-wapgw1-1:0_monitor_0, loc: wapgw1-log, priority: 0) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 30: ilo-wapgw1-1:0_monitor_0 on wapgw1-log timed out Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: Timer popped (timeout=2, abort_level=100, complete=false) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting transition, action lost: [Action 31]: In-flight (id: ilo-wapgw1-2:0_monitor_0, loc: wapgw1-log, priority: 0) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 31: ilo-wapgw1-2:0_monitor_0 on wapgw1-log timed out Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: Timer popped (timeout=2, abort_level=100, complete=false) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting transition, action lost: [Action 32]: In-flight (id: ilo-wapgw1-log:0_monitor_0, loc: wapgw1-log, priority: 0) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 32: ilo-wapgw1-log:0_monitor_0 on wapgw1-log timed out Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: Timer popped (timeout=2, abort_level=100, complete=false) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting transition, action lost: [Action 33]: In-flight (id: p-drbd-mdirect1-1:0_monitor_0, loc: wapgw1-log, priority: 0) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 33: p-drbd-mdirect1-1:0_monitor_0 on wapgw1-log timed out Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: Timer popped (timeout=2, abort_level=100, complete=false) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting transition, action lost: [Action 34]: In-flight (id: p-drbd-mdirect1-2:0_monitor_0, loc: wapgw1-log, priority: 0) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 34: p-drbd-mdirect1-2:0_monitor_0 on wapgw1-log timed out Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: Timer popped (timeout=2, abort_level=100, complete=false) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting transition, action lost: [Action 35]: In-flight (id: p-drbd-mproxy1-1:0_monitor_0, loc: wapgw1-log, priority: 0) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 35: p-drbd-mproxy1-1:0_monitor_0 on wapgw1-log timed out Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: Timer popped (timeout=2, abort_level=100, complete=false) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting transition, action lost: [Action 36]: In-flight (id: p-drbd-mproxy1-2:0_monitor_0, loc: wapgw1-log, priority: 0) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: action_timer_callback:486 - Triggered transition abort (complete=0) : Action lost Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 36: p-drbd-mproxy1-2:0_monitor_0 on wapgw1-log timed out Mar 1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: Timer popped (timeout=2, abort_level=100, complete=false) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting transition, action lost: [Action 37]: In-flight (id: p-drbd-mrouter1-1:0_monitor_0, loc: wapgw1-log, priority: 0) Mar 1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: action_timer_callback:486 - Triggered
[Pacemaker] proper dampen value for ping resource
Hi! I wonder what a proper value for dampen would be. Dampen is documented as: # attrd_updater --help|grep dampen -d, --delay=value The time to wait (dampening) in seconds further changes occur So, I would read this as the delay to forward changes, e.g. to not trigger fail-over on the first failed ping, but only after multiple failures. Now, the default values (and examples have): primitive pingtest ocf:pacemaker:ping \ params host_list=... dampen=5s \ op monitor interval=10s This means, that the host is pinged 5 attempts (default), then 10 seconds pause, then another 5 pings , then again 10 seconds pause. Thus, I would think that if pinging fails (5 of 5 fails), then this is an error, but failvoer happens 5 seconds later (dampen). Is this correct? If yes, then it would make more sense to adapt the examples and increase the dampen value to cover at least 2 failed ping attempts (5*2s + 10s + 5*2s = 30s) or shorten the attempts and interval, e.g.: attempts=1, interval=2s, dampen=5. If I am completely wrong, what behavior is really caused by dampen? Thanks Klaus ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Failback problem with active/active cluster
On Fri, Mar 11, 2011 at 2:19 PM, Charles KOPROWSKI c...@audaxis.com wrote: Le 11/03/2011 11:47, Andrew Beekhof a écrit : Essentially you have encountered a limitation in the allocation algorithm for clones in 1.0.x The recently released 1.1.5 has the behavior you're looking for, but the patch is far too invasive to consider back-porting to 1.0. Thanks Andrew, Is there any possibility to move back manualy a part of the ClusterIP resource (for example ClusterIP:1) to the other node ? Or is it just impossible with this version ? I _think_ its impossible - which is certainly not terribly useful behavior. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Failing back a multi-state resource eg. DRBD
On Mon, 2011-03-07 at 14:21 +0100, Dejan Muhamedagic wrote: Hi, On Fri, Mar 04, 2011 at 09:12:46AM -0500, David McCurley wrote: Are you wanting to move all the resources back or just that one resource? I'm still learning, but one simple way I move all resources back from nodeb to nodea is like this: # on nodeb sudo crm node standby # now services migrate to nodea # still on nodeb sudo crm node online This may be a naive way to do it but it works for now :) Yes, that would work. Though that would also make all other resources move from the standby node. There is also a crm resource migrate to migrate individual resources. For that, see here: resource migrate has no option to move ms resources, i.e. to make another node the master. What would work right now is to create a temporary location constraint: location tmp1 ms-drbd0 \ rule $id=tmp1-rule $role=Master inf: #uname eq nodea Then, once the drbd got promoted on nodea, just remove the constraint: crm configure delete tmp1 Obviously, we'd need to make some improvements here. resource migrate uses crm_resource to insert the location constraint, perhaps we should update it to also accept the role parameter. Can you please make an enhancement bugzilla report so that this doesn't get lost. Thanks, Dejan Hi Dejan, it seems that the original author did not file the bug. I entered it as http://developerbugs.linux-foundation.org/show_bug.cgi?id=2567 Regards Holger ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Failing back a multi-state resource eg. DRBD
Hi Holger, On Fri, Mar 11, 2011 at 02:45:07PM +0100, Holger Teutsch wrote: On Mon, 2011-03-07 at 14:21 +0100, Dejan Muhamedagic wrote: Hi, On Fri, Mar 04, 2011 at 09:12:46AM -0500, David McCurley wrote: Are you wanting to move all the resources back or just that one resource? I'm still learning, but one simple way I move all resources back from nodeb to nodea is like this: # on nodeb sudo crm node standby # now services migrate to nodea # still on nodeb sudo crm node online This may be a naive way to do it but it works for now :) Yes, that would work. Though that would also make all other resources move from the standby node. There is also a crm resource migrate to migrate individual resources. For that, see here: resource migrate has no option to move ms resources, i.e. to make another node the master. What would work right now is to create a temporary location constraint: location tmp1 ms-drbd0 \ rule $id=tmp1-rule $role=Master inf: #uname eq nodea Then, once the drbd got promoted on nodea, just remove the constraint: crm configure delete tmp1 Obviously, we'd need to make some improvements here. resource migrate uses crm_resource to insert the location constraint, perhaps we should update it to also accept the role parameter. Can you please make an enhancement bugzilla report so that this doesn't get lost. Thanks, Dejan Hi Dejan, it seems that the original author did not file the bug. I entered it as http://developerbugs.linux-foundation.org/show_bug.cgi?id=2567 Thanks for taking care of that. Dejan Regards Holger ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker