Re: [Pacemaker] Node remains offline (was Node remains online)

2011-03-11 Thread Andrew Beekhof
On Thu, Mar 10, 2011 at 9:10 PM, Bart Coninckx bart.conin...@telenet.be wrote:
 Hi all,

 I have a three node cluster and while introducing the third node, it
 remains offline no matter what I do.

Nothing you've shown here seems to indicate its offline - what leads
you to that conclusion?

 Another symptom is that stopping
 openais takes forever on that node, while it is waiting for crmd to unload.

 The logfile shows this node (xen3) to be online however:

 Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_ipc: Recorded connection
 0x6987c0 for attrd/10120
 Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_ipc: Recorded connection
 0x69cb20 for cib/10118
 Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_ipc: Sending membership
 update 4100 to cib
 Mar 10 20:55:26 corosync [CLM   ] CLM CONFIGURATION CHANGE
 Mar 10 20:55:26 corosync [CLM   ] New Configuration:
 Mar 10 20:55:26 corosync [CLM   ]       r(0) ip(10.0.1.13) r(1)
 ip(10.0.2.13)
 Mar 10 20:55:26 corosync [CLM   ] Members Left:
 Mar 10 20:55:26 corosync [CLM   ] Members Joined:
 Mar 10 20:55:26 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
 membership event on ring 4104: memb=1, new=0, lost=0
 Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_peer_update: memb: xen3
 218169354
 Mar 10 20:55:26 corosync [CLM   ] CLM CONFIGURATION CHANGE
 Mar 10 20:55:26 corosync [CLM   ] New Configuration:
 Mar 10 20:55:26 corosync [CLM   ]       r(0) ip(10.0.1.11) r(1)
 ip(10.0.2.11)
 Mar 10 20:55:26 corosync [CLM   ]       r(0) ip(10.0.1.12) r(1)
 ip(10.0.2.12)
 Mar 10 20:55:26 corosync [CLM   ]       r(0) ip(10.0.1.13) r(1)
 ip(10.0.2.13)
 Mar 10 20:55:26 corosync [CLM   ] Members Left:
 Mar 10 20:55:26 corosync [CLM   ] Members Joined:
 Mar 10 20:55:26 corosync [CLM   ]       r(0) ip(10.0.1.11) r(1)
 ip(10.0.2.11)
 Mar 10 20:55:26 corosync [CLM   ]       r(0) ip(10.0.1.12) r(1)
 ip(10.0.2.12)
 Mar 10 20:55:26 corosync [pcmk  ] notice: pcmk_peer_update: Stable
 membership event on ring 4104: memb=3, new=2, lost=0
 Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Creating entry
 for node 184614922 born on 4104
 Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Node
 184614922/unknown is now: member
 Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_peer_update: NEW:
 .pending. 184614922
 Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Creating entry
 for node 201392138 born on 4104
 Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Node
 201392138/unknown is now: member
 Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_peer_update: NEW:
 .pending. 201392138
 Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
 .pending. 184614922
 Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
 .pending. 201392138
 Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_peer_update: MEMB: xen3
 218169354
 Mar 10 20:55:26 corosync [pcmk  ] info: send_member_notification:
 Sending membership update 4104 to 1 children
 Mar 10 20:55:26 corosync [pcmk  ] info: update_member: 0x7f4268000c80
 Node 218169354 ((null)) born on: 4104
 Mar 10 20:55:26 corosync [TOTEM ] A processor joined or left the
 membership and a new membership was formed.
 Mar 10 20:55:26 corosync [pcmk  ] info: update_member: 0x7f4268001120
 Node 201392138 (xen2) born on: 3800
 Mar 10 20:55:26 corosync [pcmk  ] info: update_member: 0x7f4268001120
 Node 201392138 now known as xen2 (was: (null))
 Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Node xen2 now has
 process list: 00151312 (1381138)
 Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Node xen2 now has
 1 quorum votes (was 0)
 Mar 10 20:55:26 corosync [pcmk  ] info: send_member_notification:
 Sending membership update 4104 to 1 children
 Mar 10 20:55:26 corosync [pcmk  ] WARN: route_ais_message: Sending
 message to local.crmd failed: ipc delivery failed (rc=-2)
 Mar 10 20:55:26 xen3 cib: [10118]: notice: ais_dispatch_message:
 Membership 4104: quorum acquired
 Mar 10 20:55:26 corosync [pcmk  ] info: update_member: 0x7f4268000aa0
 Node 184614922 (xen1) born on: 3792
 Mar 10 20:55:26 corosync [pcmk  ] info: update_member: 0x7f4268000aa0
 Node 184614922 now known as xen1 (was: (null))
 Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Node xen1 now has
 process list: 00151312 (1381138)
 Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Node xen1 now has
 1 quorum votes (was 0)
 Mar 10 20:55:26 corosync [pcmk  ] info: update_expected_votes: Expected
 quorum votes 2 - 3
 Mar 10 20:55:26 corosync [pcmk  ] info: send_member_notification:
 Sending membership update 4104 to 1 children
 Mar 10 20:55:26 corosync [pcmk  ] WARN: route_ais_message: Sending
 message to local.crmd failed: ipc delivery failed (rc=-2)
 Mar 10 20:55:26 corosync [TOTEM ] Marking ringid 1 interface 10.0.2.13
 FAULTY - adminisrtative intervention required.
 Mar 10 20:55:26 corosync [pcmk  ] WARN: route_ais_message: Sending
 message to local.crmd failed: ipc delivery failed (rc=-2)
 Mar 10 20:55:26 xen3 

[Pacemaker] shutting down pacemaker/corosync without shutting down the services

2011-03-11 Thread Klaus Darilion
Hi!

For maintenance reasons (e.g. updating pacemaker) it might be necessary
to shut down pacemaker. But in such cases I want that the services to
keep running.

Is it possible to shut down pacemaker but keep the current service
state, ie. all services should keep running on their current node.

thanks
Klaus

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] shutting down pacemaker/corosync without shutting down the services

2011-03-11 Thread Michael Schwartzkopff
On Friday 11 March 2011 11:29:47 Klaus Darilion wrote:
 Hi!
 
 For maintenance reasons (e.g. updating pacemaker) it might be necessary
 to shut down pacemaker. But in such cases I want that the services to
 keep running.
 
 Is it possible to shut down pacemaker but keep the current service
 state, ie. all services should keep running on their current node.
 
 thanks
 Klaus
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

crm configure rsc_defaults is-managed=false

Do not forget to set is-managed=true after maintenence. 

-- 
Dr. Michael Schwartzkopff
Guardinistr. 63
81375 München

Tel: (0163) 172 50 98


signature.asc
Description: This is a digitally signed message part.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Failure after intermittent network outage

2011-03-11 Thread Andrew Beekhof
On Thu, Mar 10, 2011 at 1:03 PM, Pavel Levshin pa...@levshin.spb.ru wrote:
 Hi,

 No, I think you've missed the point. RA did not answer at all. Monitor
 actions had been lost due to a cluster transition:

You are incorrect.
While it is true that some actions were NACK's (not lost), such NACKs
do not make it into the CIB and therefor cannot be the cause of logs
such as:

Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:
Processing failed op p-drbd-mproxy1-2:0_monitor_0 on wapgw1-log:
unknown error (1)


 So, RA had not have a chance to answer anything.

Incorrect.

 Apart from this, should I fake all RA's which are supposed to be unused on
 the particular nodes in the cluster? It seemes to me like a partial solution
 only.

Either remove the RA, or make sure it returns something sensible when
tools or configuration it needs are not available.


 Suppose that I want to use Virtual machine X on hardware nodes A and B,
 and VM Y on nodes B and C. Using DRBD, this is very common configuration,
 because X cannot access it's disk device on hardware node C. Currently,
 I must configure X and Y on every hardware node, or RA will fail with
 status not configured. It's not minimalistic configuration, so it is more
 error prone than needed.

 I would be happy to tell the cluster never to touch resource X on node C
 in this case. What do you think?

No.  For safety we still need to verify that X is not running on node
C before we allow it to be active anywhere else.
That you know the X is unavailable on C is one thing, but the cluster
needs to know too.



 10.03.2011 14:09, Andrew Beekhof wrote:

 Your basic problem is this...

 Mar  1 11:17:21 wapgw1-2 pengine: [5748]: WARN: unpack_rsc_op:
 Processing failed op vm-mproxy1-1_monitor_0 on wapgw1-log: unknown
 error (1)

 We asked what state the resource was in and it replied arrrggg
 instead of not installed.
 Had it replied with not installed, we'd have no reason to call stop or
 fence the node to try and clean it up.


 --
 Pavel Levshin //flicker


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] shutting down pacemaker/corosync without shutting down the services

2011-03-11 Thread Andrew Beekhof
is-managed-default=false

On Fri, Mar 11, 2011 at 11:29 AM, Klaus Darilion
klaus.mailingli...@pernau.at wrote:
 Hi!

 For maintenance reasons (e.g. updating pacemaker) it might be necessary
 to shut down pacemaker. But in such cases I want that the services to
 keep running.

 Is it possible to shut down pacemaker but keep the current service
 state, ie. all services should keep running on their current node.

 thanks
 Klaus

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Failback problem with active/active cluster

2011-03-11 Thread Andrew Beekhof
On Thu, Mar 10, 2011 at 1:50 PM, Charles KOPROWSKI c...@audaxis.com wrote:
 Hello,

 I set up a 2 nodes cluster (active/active) to build an http reverse
 proxy/firewall. There is one vip shared by both nodes and an apache instance
 running on each node.

 Here is the configuration :

 node lpa \
        attributes standby=off
 node lpb \
        attributes standby=off
 primitive ClusterIP ocf:heartbeat:IPaddr2 \
        params ip=10.1.52.3 cidr_netmask=16 clusterip_hash=sourceip \
        op monitor interval=30s
 primitive HttpProxy ocf:heartbeat:apache \
        params configfile=/etc/apache2/apache2.conf \
        op monitor interval=1min
 clone HttpProxyClone HttpProxy
 clone ProxyIP ClusterIP \
        meta globally-unique=true clone-max=2 clone-node-max=2
 colocation HttpProxy-with-ClusterIP inf: HttpProxyClone ProxyIP
 order HttpProxyClone-after-ProxyIP inf: ProxyIP HttpProxyClone
 property $id=cib-bootstrap-options \
        dc-version=1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd \
        cluster-infrastructure=openais \
        expected-quorum-votes=2 \
        stonith-enabled=false \
        no-quorum-policy=ignore


 Everything works fine at the beginning :


 Online: [ lpa lpb ]

  Clone Set: ProxyIP (unique)
     ClusterIP:0        (ocf::heartbeat:IPaddr2):       Started lpa
     ClusterIP:1        (ocf::heartbeat:IPaddr2):       Started lpb
  Clone Set: HttpProxyClone
     Started: [ lpa lpb ]


 But after simulating an outage of one of the nodes with crm node standby
 and a recovery with crm node online, all resources stay on the same node :


 Online: [ lpa lpb ]

  Clone Set: ProxyIP (unique)
     ClusterIP:0        (ocf::heartbeat:IPaddr2):       Started lpa
     ClusterIP:1        (ocf::heartbeat:IPaddr2):       Started lpa
  Clone Set: HttpProxyClone
     Started: [ lpa ]
     Stopped: [ HttpProxy:1 ]


 Can you tell me if something is wrong in my configuration ?

Essentially you have encountered a limitation in the allocation
algorithm for clones in 1.0.x
The recently released 1.1.5 has the behavior you're looking for, but
the patch is far too invasive to consider back-porting to 1.0.


 crm_verify give me the following output :

 crm_verify[22555]: 2011/03/10_13:49:00 ERROR: clone_rsc_order_lh: Cannot
 interleave clone ProxyIP and HttpProxyClone because they do not support the
 same number of resources per node
 crm_verify[22555]: 2011/03/10_13:49:00 ERROR: clone_rsc_order_lh: Cannot
 interleave clone HttpProxyClone and ProxyIP because they do not support the
 same number of resources per node


 Many thanks,

 Regards,

 --
 Charles KOPROWSKI


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] problem with apache coming up

2011-03-11 Thread Andrew Beekhof
On Wed, Feb 16, 2011 at 4:50 PM, Testuser  SST fatcha...@gmx.de wrote:

 Failed actions:
    Apache_start_0 (node=astinos, call=19, rc=1, status=complete): unknown 
 error

 Any suggestions ? The apache is normal operabel with a service httpd 
 stop/start command.


Well, thats not the same script that the cluster is using, so that
doesn't imply much.
If I had to guess, I'd say that you probably forgot to enable the
status_url on the second machine.

Check the apache config files for differences.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Patch for bugzilla 2541: Shell should warn if parameter uniqueness is violated

2011-03-11 Thread Dejan Muhamedagic
Hi,

On Thu, Mar 10, 2011 at 07:08:20PM +0100, Holger Teutsch wrote:
 Hi Dejan,
 On Thu, 2011-03-10 at 10:14 +0100, Dejan Muhamedagic wrote:
  Hi Holger,
  
  On Wed, Mar 09, 2011 at 07:58:02PM +0100, Holger Teutsch wrote:
   Hi Dejan,
   
   On Wed, 2011-03-09 at 14:00 +0100, Dejan Muhamedagic wrote:
Hi Holger,
   
 
   
   In order to show the intention of the arguments clearer:
   
   Instead of
   
   def _verify(self, set_obj_semantic, set_obj_all = None):
   if not set_obj_all:
   set_obj_all = set_obj_semantic
   rc1 = set_obj_all.verify()
   if user_prefs.check_frequency != never:
   rc2 = set_obj_semantic.semantic_check(set_obj_all)
   else:
   rc2 = 0
   return rc1 and rc2 = 1
   def verify(self,cmd):
   usage: verify
   if not cib_factory.is_cib_sane():
   return False
   return self._verify(mkset_obj(xml))
   
   This way (always passing both args):
   
   def _verify(self, set_obj_semantic, set_obj_all):
   rc1 = set_obj_all.verify()
   if user_prefs.check_frequency != never:
   rc2 = set_obj_semantic.semantic_check(set_obj_all)
   else:
   rc2 = 0
   return rc1 and rc2 = 1
   def verify(self,cmd):
   usage: verify
   if not cib_factory.is_cib_sane():
   return False
 set_obj_all = mkset_obj(xml)
   return self._verify(set_obj_all, set_obj_all)
 
 See patch set_obj_all.diff
 
  
My only remaining concern is performance. Though the meta-data is
cached, perhaps it will pay off to save the RAInfo instance with
the element. But we can worry about that later.

   
   I can work on this as next step.
  
  I'll do some testing on really big configurations and try to
  gauge the impact.
 
 OK
 
  
  The patch makes some regression tests blow:
  
  +  File /usr/lib64/python2.6/site-packages/crm/ui.py, line 1441, in verify
  +return self._verify(mkset_obj(xml))
  +  File /usr/lib64/python2.6/site-packages/crm/ui.py, line 1433, in 
  _verify
  +rc2 = set_obj_semantic.semantic_check(set_obj_all)
  +  File /usr/lib64/python2.6/site-packages/crm/cibconfig.py, line 294, in 
  semantic_check
  +rc = self.__check_unique_clash(set_obj_all)
  +  File /usr/lib64/python2.6/site-packages/crm/cibconfig.py, line 274, in 
  __check_unique_clash
  +process_primitive(node, clash_dict)
  +  File /usr/lib64/python2.6/site-packages/crm/cibconfig.py, line 259, in 
  process_primitive
  +if ra_params[ name ]['unique'] == '1':
  +KeyError: 'OCF_CHECK_LEVEL'
  
  Can't recall why OCF_CHECK_LEVEL appears here. There must be some
  good explanation :)
 
 The good explanation is: Not only params are in instance_atributes ...
 but OCF_CHECK_LEVEL as well within operations ...

Yes, it's instance_attributes within operations.

 The latest version no longer blows the test - semantic_check.diff

Applied. Many thanks for the patch.

Cheers,

Dejan

 Regards
 Holger

 # HG changeset patch
 # User Holger Teutsch holger.teut...@web.de
 # Date 1299775617 -3600
 # Branch hot
 # Node ID 30730ccc0aa09c3a476a18c6d95c680b3595
 # Parent  9fa61ee6e35ef190f4126e163e9bfe6911e35541
 Low: Shell: Rename variable set_obj_verify to set_obj_all as it always 
 contains all objects
 Simplify usage of this var in [_]verify, pass to CibObjectSet.semantic_check
 
 diff -r 9fa61ee6e35e -r 30730ccc0aa0 shell/modules/cibconfig.py
 --- a/shell/modules/cibconfig.py  Wed Mar 09 13:41:27 2011 +0100
 +++ b/shell/modules/cibconfig.py  Thu Mar 10 17:46:57 2011 +0100
 @@ -230,7 +230,7 @@
  See below for specific implementations.
  '''
  pass
 -def semantic_check(self):
 +def semantic_check(self, set_obj_all):
  '''
  Test objects for sanity. This is about semantics.
  '''
 diff -r 9fa61ee6e35e -r 30730ccc0aa0 shell/modules/ui.py.in
 --- a/shell/modules/ui.py.in  Wed Mar 09 13:41:27 2011 +0100
 +++ b/shell/modules/ui.py.in  Thu Mar 10 17:46:57 2011 +0100
 @@ -1425,12 +1425,10 @@
  set_obj = mkset_obj(*args)
  err_buf.release() # show them, but get an ack from the user
  return set_obj.edit()
 -def _verify(self, set_obj_semantic, set_obj_verify = None):
 -if not set_obj_verify:
 -set_obj_verify = set_obj_semantic
 -rc1 = set_obj_verify.verify()
 +def _verify(self, set_obj_semantic, set_obj_all):
 +rc1 = set_obj_all.verify()
  if user_prefs.check_frequency != never:
 -rc2 = set_obj_semantic.semantic_check()
 +rc2 = set_obj_semantic.semantic_check(set_obj_all)
  else:
  rc2 = 0
  return rc1 and rc2 = 1
 @@ -1438,7 +1436,8 @@
  usage: verify
  if not cib_factory.is_cib_sane():
  return False
 -return self._verify(mkset_obj(xml))
 +set_obj_all = mkset_obj(xml)
 +return 

Re: [Pacemaker] Node remains offline (was Node remains online)

2011-03-11 Thread Bart Coninckx
Hi Andrew,

thank you for taking the time to answer.

On Friday 11 March 2011 10:57:36 Andrew Beekhof wrote:

 Nothing you've shown here seems to indicate its offline - what leads
 you to that conclusion?

both crm_mon and hb_gui show this.


Thank you,

B.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Drbd on a asymmetric cluster

2011-03-11 Thread Arthur B. Olsen
My config is:

node sql01 attributes standby=off
node sql02 attributes standby=off
primitive drbd_mysql ocf:linbit:drbd params drbd_resource=r0 op monitor 
interval=15s
primitive fs_mysql ocf:heartbeat:Filesystem params device=/dev/drbd0 
directory=/datastore01 fstype=ext4
primitive ip_mysql ocf:heartbeat:IPaddr2 params ip=192.168.50.20
primitive mysqld lsb:mysql
group mysql fs_mysql ip_mysql mysqld
ms ms_drbd_mysql drbd_mysql meta master-max=1 master-node-max=1 
clone-max=2 clone-node-max=1 notify=true
location ms_drbd_mysql_on_both ms_drbd_mysql rule 
$id=ms_drbd_mysql_on_both-rule inf: #uname eq sql01 or #uname eq sql02
location mysql_pri_loc mysql inf: sql01
location mysql_alt_loc mysql 100: sql02
colocation mysql_on_drbd inf: mysql ms_drbd_mysql:Master
order mysql_after_drbd inf: ms_drbd_mysql:promote mysql:start
property $id=cib-bootstrap-options 
dc-version=1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b 
cluster-infrastructure=openais expected-quorum-votes=2 
stonith-enabled=false symmetric-cluster=false no-quorum-policy=ignore

I hope someone can explain this to me. It starts everything up in sql01 as 
primary and sql02 as secondary. When i put sql01 in standby it moves to sql02.
But when i put sql01 online, the resources stay put on sql02. My stickiness is 
zero.

I have tried to replace the ms_drbd_mysql_on_both with:
location ms_drbd_mysql_pri_loc ms_drbd_mysql inf: sql01
location ms_drbd_mysql_sec_loc ms_drbd_mysql 100: sql02

Then it freaks out, switching rapidly between

Node sql01: online
drbd_mysql:0 (ocf::linbit:drbd) Slave
Node sql02: online
drbd_mysql:1 (ocf::linbit:drbd) Slave

And

Node sql01: online
Node sql02: online
drbd_mysql:1 (ocf::linbit:drbd) Slave


I can't find any examples on how to do drbd on an asymmetric cluster

Thanks
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Node remains offline (was Node remains online)

2011-03-11 Thread Andrew Beekhof
On Fri, Mar 11, 2011 at 12:12 PM, Bart Coninckx
bart.conin...@telenet.be wrote:
 Hi Andrew,

 thank you for taking the time to answer.

 On Friday 11 March 2011 10:57:36 Andrew Beekhof wrote:

 Nothing you've shown here seems to indicate its offline - what leads
 you to that conclusion?

 both crm_mon and hb_gui show this.


Could you show us too?
Also attach the result of cibadmin -Ql

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Failure after intermittent network outage

2011-03-11 Thread Pavel Levshin

Hi Andrew.


I'm sorry, but I can not agree.

Look again at the DC log. Here it says: Action lost. This is why I use 
this term.


Then it declares every monitor action as it has failed with rc=1, which 
is not true. Note that even those actions which were directed to 
inexistent RA are listed as failed with rc=1. (DRBD is not installed on 
target server, so there is no ocf:linbit:drbd).



Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: 
Timer popped (timeout=2, abort_level=100, complete=false)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting 
transition, action lost: [Action 30]: In-flight (id: 
ilo-wapgw1-1:0_monitor_0, loc: wapgw1-log, priority: 0)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: 
action_timer_callback:486 - Triggered transition abort (complete=0) : 
Action lost
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 
30: ilo-wapgw1-1:0_monitor_0 on wapgw1-log timed out
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: 
Timer popped (timeout=2, abort_level=100, complete=false)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting 
transition, action lost: [Action 31]: In-flight (id: 
ilo-wapgw1-2:0_monitor_0, loc: wapgw1-log, priority: 0)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: 
action_timer_callback:486 - Triggered transition abort (complete=0) : 
Action lost
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 
31: ilo-wapgw1-2:0_monitor_0 on wapgw1-log timed out
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: 
Timer popped (timeout=2, abort_level=100, complete=false)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting 
transition, action lost: [Action 32]: In-flight (id: 
ilo-wapgw1-log:0_monitor_0, loc: wapgw1-log, priority: 0)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: 
action_timer_callback:486 - Triggered transition abort (complete=0) : 
Action lost
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 
32: ilo-wapgw1-log:0_monitor_0 on wapgw1-log timed out
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: 
Timer popped (timeout=2, abort_level=100, complete=false)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting 
transition, action lost: [Action 33]: In-flight (id: 
p-drbd-mdirect1-1:0_monitor_0, loc: wapgw1-log, priority: 0)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: 
action_timer_callback:486 - Triggered transition abort (complete=0) : 
Action lost
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 
33: p-drbd-mdirect1-1:0_monitor_0 on wapgw1-log timed out
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: 
Timer popped (timeout=2, abort_level=100, complete=false)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting 
transition, action lost: [Action 34]: In-flight (id: 
p-drbd-mdirect1-2:0_monitor_0, loc: wapgw1-log, priority: 0)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: 
action_timer_callback:486 - Triggered transition abort (complete=0) : 
Action lost
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 
34: p-drbd-mdirect1-2:0_monitor_0 on wapgw1-log timed out
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: 
Timer popped (timeout=2, abort_level=100, complete=false)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting 
transition, action lost: [Action 35]: In-flight (id: 
p-drbd-mproxy1-1:0_monitor_0, loc: wapgw1-log, priority: 0)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: 
action_timer_callback:486 - Triggered transition abort (complete=0) : 
Action lost
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 
35: p-drbd-mproxy1-1:0_monitor_0 on wapgw1-log timed out
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: 
Timer popped (timeout=2, abort_level=100, complete=false)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting 
transition, action lost: [Action 36]: In-flight (id: 
p-drbd-mproxy1-2:0_monitor_0, loc: wapgw1-log, priority: 0)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: 
action_timer_callback:486 - Triggered transition abort (complete=0) : 
Action lost
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: cib_action_update: rsc_op 
36: p-drbd-mproxy1-2:0_monitor_0 on wapgw1-log timed out
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: WARN: action_timer_callback: 
Timer popped (timeout=2, abort_level=100, complete=false)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: ERROR: print_elem: Aborting 
transition, action lost: [Action 37]: In-flight (id: 
p-drbd-mrouter1-1:0_monitor_0, loc: wapgw1-log, priority: 0)
Mar  1 11:17:20 wapgw1-2 crmd: [5749]: info: abort_transition_graph: 
action_timer_callback:486 - Triggered 

[Pacemaker] proper dampen value for ping resource

2011-03-11 Thread Klaus Darilion
Hi!

I wonder what a proper value for dampen would be. Dampen is documented as:

# attrd_updater --help|grep dampen
 -d, --delay=value  The time to wait (dampening) in seconds further
changes occur


So, I would read this as the delay to forward changes, e.g. to not
trigger fail-over on the first failed ping, but only after multiple
failures.

Now, the default values (and examples have):

primitive pingtest ocf:pacemaker:ping \
params host_list=... dampen=5s \
op monitor interval=10s

This means, that the host is pinged 5 attempts (default), then 10
seconds pause, then another 5 pings , then again 10 seconds pause.

Thus, I would think that if pinging fails (5 of 5 fails), then this is
an error, but failvoer happens 5 seconds later (dampen). Is this
correct? If yes, then it would make more sense to adapt the examples and
increase the dampen value to cover at least 2 failed ping attempts (5*2s
+ 10s + 5*2s = 30s) or shorten the attempts and interval, e.g.:
attempts=1, interval=2s, dampen=5.

If I am completely wrong, what behavior is really caused by dampen?

Thanks
Klaus

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Failback problem with active/active cluster

2011-03-11 Thread Andrew Beekhof
On Fri, Mar 11, 2011 at 2:19 PM, Charles KOPROWSKI c...@audaxis.com wrote:
 Le 11/03/2011 11:47, Andrew Beekhof a écrit :

 Essentially you have encountered a limitation in the allocation
 algorithm for clones in 1.0.x
 The recently released 1.1.5 has the behavior you're looking for, but
 the patch is far too invasive to consider back-porting to 1.0.

 Thanks Andrew,

 Is there any possibility to move back manualy a part of the ClusterIP
 resource (for example ClusterIP:1) to the other node ? Or is it just
 impossible with this version ?

I _think_ its impossible - which is certainly not terribly useful behavior.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Failing back a multi-state resource eg. DRBD

2011-03-11 Thread Holger Teutsch
On Mon, 2011-03-07 at 14:21 +0100, Dejan Muhamedagic wrote:
 Hi,
 
 On Fri, Mar 04, 2011 at 09:12:46AM -0500, David McCurley wrote:
  Are you wanting to move all the resources back or just that one resource?
  
  I'm still learning, but one simple way I move all resources back from nodeb 
  to nodea is like this:
  
  # on nodeb
  sudo crm node standby
  # now services migrate to nodea
  # still on nodeb
  sudo crm node online
  
  This may be a naive way to do it but it works for now :)
 
 Yes, that would work. Though that would also make all other
 resources move from the standby node.
 
  There is also a crm resource migrate to migrate individual resources.  
  For that, see here:
 
 resource migrate has no option to move ms resources, i.e. to make
 another node the master.
 
 What would work right now is to create a temporary location
 constraint:
 
 location tmp1 ms-drbd0 \
 rule $id=tmp1-rule $role=Master inf: #uname eq nodea
 
 Then, once the drbd got promoted on nodea, just remove the
 constraint:
 
 crm configure delete tmp1
 
 Obviously, we'd need to make some improvements here. resource
 migrate uses crm_resource to insert the location constraint,
 perhaps we should update it to also accept the role parameter.
 
 Can you please make an enhancement bugzilla report so that this
 doesn't get lost.
 
 Thanks,
 
 Dejan

Hi Dejan,
it seems that the original author did not file the bug.
I entered it as

http://developerbugs.linux-foundation.org/show_bug.cgi?id=2567

Regards
Holger



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Failing back a multi-state resource eg. DRBD

2011-03-11 Thread Dejan Muhamedagic
Hi Holger,

On Fri, Mar 11, 2011 at 02:45:07PM +0100, Holger Teutsch wrote:
 On Mon, 2011-03-07 at 14:21 +0100, Dejan Muhamedagic wrote:
  Hi,
  
  On Fri, Mar 04, 2011 at 09:12:46AM -0500, David McCurley wrote:
   Are you wanting to move all the resources back or just that one resource?
   
   I'm still learning, but one simple way I move all resources back from 
   nodeb to nodea is like this:
   
   # on nodeb
   sudo crm node standby
   # now services migrate to nodea
   # still on nodeb
   sudo crm node online
   
   This may be a naive way to do it but it works for now :)
  
  Yes, that would work. Though that would also make all other
  resources move from the standby node.
  
   There is also a crm resource migrate to migrate individual resources.  
   For that, see here:
  
  resource migrate has no option to move ms resources, i.e. to make
  another node the master.
  
  What would work right now is to create a temporary location
  constraint:
  
  location tmp1 ms-drbd0 \
  rule $id=tmp1-rule $role=Master inf: #uname eq nodea
  
  Then, once the drbd got promoted on nodea, just remove the
  constraint:
  
  crm configure delete tmp1
  
  Obviously, we'd need to make some improvements here. resource
  migrate uses crm_resource to insert the location constraint,
  perhaps we should update it to also accept the role parameter.
  
  Can you please make an enhancement bugzilla report so that this
  doesn't get lost.
  
  Thanks,
  
  Dejan
 
 Hi Dejan,
 it seems that the original author did not file the bug.
 I entered it as
 
 http://developerbugs.linux-foundation.org/show_bug.cgi?id=2567

Thanks for taking care of that.

Dejan

 Regards
 Holger
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker