Re: [Linux-ha-dev] Announcing - the Assimilation monitoring system - a sub-project of Linux-HA

2011-08-17 Thread Alan Robertson
On 08/16/2011 05:08 PM, Angus Salkeld wrote:
 On Fri, Aug 12, 2011 at 03:11:36PM -0600, Alan Robertson wrote:
 Hi,

 Back last November or so, I started work on a new monitoring project -
 for monitoring servers and services.

 It's aims are:
 - Scalable virtually without limit - tens of thousands of servers is
 not a problem
 - Easy installation and upkeep - includes integrated discovery of
 servers, services
   and switches - without setting off security alarms ;-)

 This project isn't ready for a public release yet (it's in a fairly
 early stage), but it seemed worthwhile to let others know that the
 project exists, and start getting folks to read over the code, and
 perhaps begin to play with it a bit as well.

 The project has two arenas of operation:
   nanoprobes - which run in (nearly) every monitored machine
 Why not matahari (http://matahariproject.org/)?

   Collective management - running in a central server (or HA cluster).
 Quite simerlar to http://pacemaker-cloud.org/. Seems a
 shame not to be working together.

 -Angus
This is a set of ideas I've been working on for the last four years or 
so.  My most grandiose vision of it I called a Data Center Operating 
System.   This is about the same time that Amazon announced their first 
cloud offering (unknown to me).  There are a few hints about it a couple 
of years ago in my blog.

I heard a little about Andrew's project when I announced this back in 
November.  Andrew has made it perfectly clear that he doesn't want to 
work with me (really, absolutely, abundantly, perfectly, crystal clear) 
and there is evidence that he doesn't work well with others besides me, 
that's not a possibility.

In the short term I'm not specially concerned with clouds - just with 
any collection of computers which range from 4 up to and above cloud 
scale.  That includes clouds of course - but we'll get a lot more users 
at the small scale than we well at cloud scale.

There are several reasons for this approach:
   - Existing monitoring software sucks.
   - Many more collections of computers besides clouds exist and need 
help - although this would work very well with clouds

This problem has dimensions that a cloud environment doesn't have.  In a 
cloud, all deployment is automated, so you can _know_ what is running 
where.  In a more conventional data center, having a way to discover 
what's in your data center, and what's running on those servers is 
important.

-- 
 Alan Robertsonal...@unix.sh

Openness is the foundation and preservative of friendship...  Let me claim 
from you at all times your undisguised opinions. - William Wilberforce
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] OCF RA for named

2011-08-17 Thread Lars Ellenberg
On Tue, Aug 16, 2011 at 08:51:04AM -0600, Serge Dubrouski wrote:
 On Tue, Aug 16, 2011 at 8:44 AM, Dejan Muhamedagic de...@suse.de wrote:
 
  Hi Serge,
 
  On Fri, Aug 05, 2011 at 08:19:52AM -0600, Serge Dubrouski wrote:
   No interest?
 
  Probably not true :) It's just that recently I've been away for
  a while and in between really swamped with my daily work. I'm
  trying to catch up now, but it may take a while.
 
  In the meantime, I'd like to ask you about the motivation. DNS
  already has a sort of redundancy built in through its
  primary/secondary servers.
 
 
 That redundancy doesn't work quite well. Yes you can have primary and
 secondary servers configured in resolv.conf but if primary is down resolver
 waits till request times out for the primary server till it sends a request
 to the secondary one. The dealy can be up to 30 seconds and impacts some
 applications pretty badly, This is standard behaviour for Linux, Solaris for
 example works differently and isn't impacted by this issue. Works around are
 having caching DNS server working locally or having primary DNS server
 highly available with using Pacemaker :-)
 
 Here is what man page for resolv.conf says:
 
  nameserver Name server IP address
 Internet  address  (in  dot  notation) of a name server that the
 resolver  should  query.   Up  to  MAXNS   (currently   3, see
 resolv.h)  name  servers  may  be listed, one per keyword.  If
 there are multiple servers, the resolver library queries them in
 the  order  listed.   If  no nameserver entries are present, the
 default is to use the name server on the  local  machine.  *(The
 algorithm  used  is to try a name server, and if the query times
 out, try the next, until out of name servers, then repeat trying
 all  the  name  servers  until  a  maximum number of retries are
 made.)*

options timeout:2 attempts:5 rotate

but yes, it is still a valid use case to have a clustered primary name server,
and possibly multiple backups.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Announcing - the Assimilation monitoring system - a sub-project of Linux-HA

2011-08-17 Thread Angus Salkeld
On Wed, Aug 17, 2011 at 09:51:14AM -0600, Alan Robertson wrote:
 On 08/16/2011 05:08 PM, Angus Salkeld wrote:
  On Fri, Aug 12, 2011 at 03:11:36PM -0600, Alan Robertson wrote:
  Hi,
 
  Back last November or so, I started work on a new monitoring project -
  for monitoring servers and services.
 
  It's aims are:
  - Scalable virtually without limit - tens of thousands of servers is
  not a problem
  - Easy installation and upkeep - includes integrated discovery of
  servers, services
and switches - without setting off security alarms ;-)
 
  This project isn't ready for a public release yet (it's in a fairly
  early stage), but it seemed worthwhile to let others know that the
  project exists, and start getting folks to read over the code, and
  perhaps begin to play with it a bit as well.
 
  The project has two arenas of operation:
nanoprobes - which run in (nearly) every monitored machine
  Why not matahari (http://matahariproject.org/)?
 
Collective management - running in a central server (or HA cluster).
  Quite simerlar to http://pacemaker-cloud.org/. Seems a
  shame not to be working together.
 
  -Angus
 This is a set of ideas I've been working on for the last four years or 
 so.  My most grandiose vision of it I called a Data Center Operating 
 System.   This is about the same time that Amazon announced their first 
 cloud offering (unknown to me).  There are a few hints about it a couple 
 of years ago in my blog.
 
 I heard a little about Andrew's project when I announced this back in 
 November.  Andrew has made it perfectly clear that he doesn't want to 
 work with me (really, absolutely, abundantly, perfectly, crystal clear) 
 and there is evidence that he doesn't work well with others besides me, 
 that's not a possibility.

Oops, seems like I have really stepped in it. Sorry for bringing this
up again.

 
 In the short term I'm not specially concerned with clouds - just with 
 any collection of computers which range from 4 up to and above cloud 
 scale.  That includes clouds of course - but we'll get a lot more users 
 at the small scale than we well at cloud scale.
 
 There are several reasons for this approach:
- Existing monitoring software sucks.
- Many more collections of computers besides clouds exist and need 
 help - although this would work very well with clouds
 
 This problem has dimensions that a cloud environment doesn't have.  In a 
 cloud, all deployment is automated, so you can _know_ what is running 
 where.  In a more conventional data center, having a way to discover 
 what's in your data center, and what's running on those servers is 
 important.

Well technically this could be easily done in one project, but it seems
that's not going to happen (with everyone working on it).

-Angus

 
 -- 
  Alan Robertsonal...@unix.sh
 
 Openness is the foundation and preservative of friendship...  Let me claim 
 from you at all times your undisguised opinions. - William Wilberforce
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Announcing - the Assimilation monitoring system - a sub-project of Linux-HA

2011-08-17 Thread Digimer
On 08/17/2011 07:16 PM, Angus Salkeld wrote:
 On Wed, Aug 17, 2011 at 09:51:14AM -0600, Alan Robertson wrote:
 On 08/16/2011 05:08 PM, Angus Salkeld wrote:
 On Fri, Aug 12, 2011 at 03:11:36PM -0600, Alan Robertson wrote:
 Hi,

 Back last November or so, I started work on a new monitoring project -
 for monitoring servers and services.

 It's aims are:
 - Scalable virtually without limit - tens of thousands of servers is
 not a problem
 - Easy installation and upkeep - includes integrated discovery of
 servers, services
   and switches - without setting off security alarms ;-)

 This project isn't ready for a public release yet (it's in a fairly
 early stage), but it seemed worthwhile to let others know that the
 project exists, and start getting folks to read over the code, and
 perhaps begin to play with it a bit as well.

 The project has two arenas of operation:
   nanoprobes - which run in (nearly) every monitored machine
 Why not matahari (http://matahariproject.org/)?

   Collective management - running in a central server (or HA cluster).
 Quite simerlar to http://pacemaker-cloud.org/. Seems a
 shame not to be working together.

 -Angus
 This is a set of ideas I've been working on for the last four years or 
 so.  My most grandiose vision of it I called a Data Center Operating 
 System.   This is about the same time that Amazon announced their first 
 cloud offering (unknown to me).  There are a few hints about it a couple 
 of years ago in my blog.

 I heard a little about Andrew's project when I announced this back in 
 November.  Andrew has made it perfectly clear that he doesn't want to 
 work with me (really, absolutely, abundantly, perfectly, crystal clear) 
 and there is evidence that he doesn't work well with others besides me, 
 that's not a possibility.
 
 Oops, seems like I have really stepped in it. Sorry for bringing this
 up again.
 

 In the short term I'm not specially concerned with clouds - just with 
 any collection of computers which range from 4 up to and above cloud 
 scale.  That includes clouds of course - but we'll get a lot more users 
 at the small scale than we well at cloud scale.

 There are several reasons for this approach:
- Existing monitoring software sucks.
- Many more collections of computers besides clouds exist and need 
 help - although this would work very well with clouds

 This problem has dimensions that a cloud environment doesn't have.  In a 
 cloud, all deployment is automated, so you can _know_ what is running 
 where.  In a more conventional data center, having a way to discover 
 what's in your data center, and what's running on those servers is 
 important.
 
 Well technically this could be easily done in one project, but it seems
 that's not going to happen (with everyone working on it).
 
 -Angus

Linux is fairly described as an ecosystem. Differing branches and
methods of solving a given problem are tried, and the one with the most
backing and merit wins. It's part of what makes open-source what it is.
So, from my point of view, best of luck to both. :)

-- 
Digimer
E-Mail:  digi...@alteeve.com
Freenode handle: digimer
Papers and Projects: http://alteeve.com
Node Assassin:   http://nodeassassin.org
At what point did we forget that the Space Shuttle was, essentially,
a program that strapped human beings to an explosion and tried to stab
through the sky with fire and math?
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-HA] mysql problem when enabling semi-synchronous replication with corosync, mysql_home variable is mandatory

2011-08-17 Thread Gregory Steulet
Hi folks, 

Today I encoutered some difficulties to use mysql resource in a 
semi-synchronous replication with MySQL and corosync. As you probably know to 
use semi-synchronous replication, mysql plugins are mandatory. 

If you try to start mysql without specifying the mysql base directory (basedir) 
you will get some error messages and mysql will shutdown immediately as 
demonstrated below 


[root@mysql-qual1 heartbeat]# /u00/app/mysql/product/mysql-5.5.15/bin/mysqld 
--defaults-file=/etc/my.cnf 
--pid_file=/u00/app/mysql/admin/mysqld1/socket/mysqld1.pid 
--socket=/u00/app/mysql/admin/mysqld1/socket/mld1.sock 
--datadir=/u01/mysqldata/mysqld1 --user=mysql --port=33001 
110816 18:16:32 [ERROR] Can't find messagefile 
'/usr/local/mysql/share/errmsg.sys' 
110816 18:16:32 [Note] Plugin 'FEDERATED' is disabled. 
110816 18:16:32 [ERROR] 
110816 18:16:32 [Warning] Couldn't load plugin named 'rpl_semi_sync_master' 
with soname 'semisync_master.so'. 
110816 18:16:32 [ERROR] 
110816 18:16:32 [Warning] Couldn't load plugin named 'rpl_semi_sync_slave' with 
soname 'semisync_slave.so'. 
110816 18:16:32 InnoDB: The InnoDB memory heap is disabled 
110816 18:16:32 InnoDB: Mutexes and rw_locks use GCC atomic builtins 
110816 18:16:32 InnoDB: Compressed tables use zlib 1.2.3 
110816 18:16:32 InnoDB: Using Linux native AIO 
110816 18:16:32 InnoDB: Initializing buffer pool, size = 128.0M 
110816 18:16:32 InnoDB: Completed initialization of buffer pool 
110816 18:16:32 InnoDB: highest supported file format is Barracuda. 
110816 18:16:32 InnoDB: Waiting for the background threads to start 
110816 18:16:33 InnoDB: 1.1.8 started; log sequence number 1682335 
110816 18:16:33 [ERROR] Aborting 

110816 18:16:33 InnoDB: Starting shutdown... 
110816 18:16:33 InnoDB: Shutdown completed; log sequence number 1682335 
110816 18:16:33 [Note] 

In order to solve this problem it is mandatory to set the basedir which is not 
done in the file /usr/lib/ocf/resource.d/heartbeat/mysql. Otherwise your mysql 
resource won't be able to start. Additionally it would be really clever to have 
the possibility to set the port without using the additional_parameters. I 
made some tests by setting the two additional variables below and it's 
perfectly working. 

export OCF_RESKEY_binary=/u00/app/mysql/product/mysql-5.5.15/bin/mysqld 
export OCF_RESKEY_client_binary=/u00/app/mysql/product/mysql-5.5.15/bin/mysql 
export OCF_RESKEY_config=/u00/app/mysql/etc/my.cnf 
export OCF_RESKEY_datadir=/u01/mysqldata/mysqld1 
export OCF_RESKEY_user=mysql 
export OCF_RESKEY_group=mysql 
export OCF_RESKEY_test_table=mysql.user 
export OCF_RESKEY_test_user=root 
export OCF_RESKEY_test_passwd=manager 
export OCF_RESKEY_log=/u00/app/mysql/admin/mysqld1/log/mysqld1.log 
export OCF_RESKEY_pid=/u00/app/mysql/admin/mysqld1/socket/mysqld1.pid 
export OCF_RESKEY_socket=/u00/app/mysql/admin/mysqld1/socket/mysqld1.sock 
export OCF_RESKEY_port=33001 
export OCF_RESKEY_basedir=/u00/app/mysql/product/mysql-5.5.15 


In order to be able to use it in the corosync environnement, the following 
changes are mandatory in the file /usr/lib/ocf/resource.d/heartbeat/mysql 

mysql_start() { 
... 
${OCF_RESKEY_binary} --defaults-file=$OCF_RESKEY_config \ 
--pid-file=$OCF_RESKEY_pid \ 
--socket=$OCF_RESKEY_socket \ 
--datadir=$OCF_RESKEY_datadir \ 
--basedir=$OCF_RESKEY_basedir \ 
--port=$OCF_RESKEY_port \ 
--user=$OCF_RESKEY_user $OCF_RESKEY_additional_parameters /dev/null 21  
rc=$? 
... 

Best regards 

Greg 




Gregory Steulet 
Senior Consultant 
Delivery Manager 

gregory.steu...@dbi-services.com 
+41 79 963 43 69 
www.dbi-services.com 

dbi services Lausanne 
chemin Maillefer 36 | CH-1052 Le Mont-sur-Lausanne 

Blog www.dbi-services.com/blog 


Follow dbi services! 


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Bug or feature: crm_mon -1n

2011-08-17 Thread Ulrich Windl
Hi,

I wonder why no groups are displayed for crm_mon -1n: The resources are 
ordered by node where they are running on, but still some resources belong to 
groups. However the groups are no longer displayed with -n (node view). That 
display doesn't make much sense with groups IMHO.

Regards,
Ulrich


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] IPaddr not work correctly on large routing tables on Linux

2011-08-17 Thread Dejan Muhamedagic
Hi,

On Fri, Jun 24, 2011 at 08:10:58PM +0200, Lars Ellenberg wrote:
 On Fri, Jun 24, 2011 at 10:40:36AM +0200, Dejan Muhamedagic wrote:
  Hi,
  
  On Tue, May 17, 2011 at 04:10:14PM +0400, Alexander Polyakov wrote:
   IPaddr not work correctly on large routing tables on Linux
   
   If the contents of the routing table is large (eg, full view, derived
   from the quagga) then the node can not remove the public IP address.
   In the top hang two Process: route and grep remove ip . Because of
   this, there is no valid switch node.
  
  Just how big is your routing table? What does time route -n
  show?
 
 Oh, a full routing table can get quite large.
 The BGP table currently has  350k entries ;-)
 
 Besides, there is no point in doing
   route -n | grep $IP first,
   and then doing route -n del -host $IP
 
 The grep is wrong, anyways, as it may match more than just the
 intended IP.  Testing on non-empty output ([ `x | y` ]) instead
 of the exit code (x | y /dev/null) looks bit strange as well.
 And, if it's not there, route del may complain,
 but all is good and shiny, still.
 
 So I'm fine with just dropping that if route -n | grep ; then.

And it has been removed.

Cheers,

Dejan

 Possibly IPaddr2 works better for Alexander?
 
 -- 
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com
 
 DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Stopping heartbeat on secondary node causes primary to fail

2011-08-17 Thread Lars Ellenberg
On Tue, Aug 09, 2011 at 04:03:40PM -0700, Chris Huber-Lantz wrote:
 Thanks for taking the time to post, I'm hoping you could clarify a bit. 
  From the linux-ha documentation on the ucast directive:
 
 Note that ucast directives which go to the local machine are effectively 
 ignored. This allows the ha.cf directives on all machines to be identical.
 
 This would seem to contradict what you are saying about the active 
 primary needing to see its own heartbeat in order to hold on to its 
 resources. Adding to this is the fact that if we restart heartbeat on 
 the primary it comes up as normal and not seeing a heartbeat from the 
 secondary, assumes control of the resources. If the problem was lack of 
 a local heartbeat I don't understand why the server could come back up 
 like this.

You would need to send some logs, from everything ok, just before you
shut down one node, to when the remaining node shuts down its resources.

This seems to be an haresources style cluster?
Is haresources identical on both nodes?  It has to be.

I assume ha.cf is reciprocal with regard to the peer address used on the
udp line?

I would recommend to just put both IPs as udp lines in the ha.cf,
so you can have ha.cf identical on both nodes as well.


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Question about max_child_count

2011-08-17 Thread Lars Ellenberg
On Mon, Aug 08, 2011 at 12:04:49PM +0200, alain.mou...@bull.net wrote:
 Hi
 
 I wonder if the default value of max_child_count (4) has been increased on 
 last Pacemaker releases ? 
 or if it is now possible to tune it ?

The heartbeat init script does something like this:
$LRMADMIN -p max-children $LRMD_MAX_CHILDREN

where LRMADMIN is simply lrmadmin, and LRMD_MAX_CHILDREN is a variable
supposed to be set in one of the sourced sysconfig or defaults files.

You apparently can only tune it once lrmd is up and running,
lrmd does not read any configuration on its own
(correct me if I'm wrong).

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ocf:heartbeat:Xen: shutdown timeout

2011-08-17 Thread Lars Ellenberg
On Thu, Aug 11, 2011 at 10:56:15AM +0200, Ulrich Windl wrote:
 Hi!
 
 Sorry, if this has been discussed before, but I think ocf:heartbeat:Xen does 
 not what the documentations says about timeout:
 
 parameter name=shutdown_timeout
 longdesc lang=en
 The Xen agent will first try an orderly shutdown using xm shutdown.
 Should this not succeed within this timeout, the agent will escalate to
 xm destroy, forcibly killing the node.
 
 If this is not set, it will default to two-third of the stop action
 timeout.
 
 Setting this value to 0 forces an immediate destroy.
 /longdesc
 
 The code to set the timeout is this:
 if [ -n $OCF_RESKEY_shutdown_timeout ]; then
   timeout=$OCF_RESKEY_shutdown_timeout
 elif [ -n $OCF_RESKEY_CRM_meta_timeout ]; then
   # Allow 2/3 of the action timeout for the orderly shutdown
   # (The origin unit is ms, hence the conversion)
   timeout=$((OCF_RESKEY_CRM_meta_timeout/1500))
 else
   timeout=60
 fi
 
 The primitive was configured like this:
 primitive prm_v02_xen ocf:heartbeat:Xen params xmfile=/etc/xen/vm/v02 op 
 start timeout=300 op stop timeout=300 op monitor interval=1200 
 timeout=90
 
 So I'd expect 2/3rds of 300s to be 200s. However the syslog says:
 Aug 11 10:14:37 h01 Xen[25140]: INFO: Xen domain v02 will be stopped 
 (timeout: 13s)
 Aug 11 10:14:50 h01 Xen[25140]: WARNING: Xen domain v02 will be destroyed!
 
 According to the code, that's printed here:
 if [ $timeout -gt 0 ]; then
   ocf_log info Xen domain $dom will be stopped (timeout: ${timeout}s)
 
 So I guess something is wrong.

There has been a pacemaker bug (or was it lrmd bug?) that caused the
stop action to be sometimes passed an incorrect *_CRM_meta_*
environment.

2 / 1500 happens to end up being 13, so maybe somehow the timeout
used some default value of 20 seconds?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Problem with kvm virtual machine and cluster

2011-08-17 Thread Lars Ellenberg
On Thu, Aug 11, 2011 at 04:04:36PM +1000, Andrew Beekhof wrote:
 On Wed, Aug 10, 2011 at 11:15 PM, Maloja01 maloj...@arcor.de wrote:
  The order constraints do work as I assume, but I guess that
  you run into a pifall:
 
  A clone is marked as up, if one instance in the cluster is started
  successfully. The order does not say, that the clone on the same node
  must be up.
 
 Use a colocation constraint to have that
 
  Kind regards
  Fabian
 
  On 08/10/2011 01:43 PM, i...@umbertocarrara.it wrote:
  hi,
  excuse me for my poor english, i use google to help me in traslation
  and I am a newbie in clustering :-).
 
  I'm trying to start a cluster with tree nodes for virtualizzation, I have 
  used
  a how-to that I found at http://www.linbit.com/support/ha-kvm.pdf to
  configure the cluster, volumes of vm are shared on openfiler cluster on 
  iscsi
  that works well.
 
  vm start ok in hosts if I'm out of the cluster.
 
  The problem is that the vm start before libvirt and open-iscsi initiator
  I have set a order rule but seems wont work.
  after when services are started the cluster can not restart the machine
 
 
  so the output of crm_mon -1 is
  
  Last updated: Wed Aug 10 12:40:20 2011
  Stack: openais
  Current DC: host1 - partition with quorum
  Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
  3 Nodes configured, 3 expected votes
  2 Resources configured.
  
 
  Online: [ host1 host2 host3 ]
 
   Clone Set: BackEndClone
       Started: [ host1 host2 host3 ]
  Samba   (ocf::heartbeat:VirtualDomain) Started [        host1   host2
  host3 ]
 
  Failed actions:
      Samba_monitor_0 (node=host1, call=15, rc=1, status=complete): unknown
  error
      Samba_stop_0 (node=host1, call=16, rc=1, status=complete): unknown 
  error
      Samba_monitor_0 (node=host2, call=12, rc=1, status=complete): unknown
  error
      Samba_stop_0 (node=host2, call=13, rc=1, status=complete): unknown 
  error
      Samba_monitor_0 (node=host3, call=12, rc=1, status=complete): unknown
  error
      Samba_stop_0 (node=host3, call=13, rc=1, status=complete): unknown 
  error
 
 
 
 
  this is my cluster config:
 
  root@host1:~# crm configure show
  node host1 \
          attributes standby=on
  node host2 \
          attributes standby=on
  node host3 \
          attributes standby=on
  primitive Iscsi lsb:open-iscsi \
          op monitor interval=30
  primitive Samba ocf:heartbeat:VirtualDomain \
          params config=/etc/libvirt/qemu/samba.iso.xml \
          meta allow-migrate=true \
          op monitor interval=30
  primitive Virsh lsb:libvirt-bin \
          op monitor interval=30
  group BackEnd Iscsi Virsh
  clone BackEndClone BackEnd \
          meta target-role=Started
  colocation SambaOnBackEndClone inf: Samba BackEndClone
  order SambaBeforeBackEndClone inf: BackEndClone Samba

I think you want to reverse those to do what their id implies:

colocation SambaOnBackEndClone inf: BackEndClone Samba
order SambaBeforeBackEndClone inf: Samba BackEndClone

  property $id=cib-bootstrap-options \
          dc-version=1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b \
          cluster-infrastructure=openais \
          expected-quorum-votes=3 \
          stonith-enabled=false \
          no-quorum-policy=ignore \
          default-action-timeout=100 \
          last-lrm-refresh=1312970592
  rsc_defaults $id=rsc-options \
          resource-stickiness=200
 
  my log is:
 
  Aug 10 13:36:34 host1 pengine: [1923]: info: get_failcount: Samba has 
  failed
  INFINITY times on host1
  Aug 10 13:36:34 host1 pengine: [1923]: WARN: common_apply_stickiness: 
  Forcing
  Samba away from host1 after 100 failures (max=100)
  Aug 10 13:36:34 host1 pengine: [1923]: info: get_failcount: Samba has 
  failed
  INFINITY times on host2
  Aug 10 13:36:34 host1 pengine: [1923]: WARN: common_apply_stickiness: 
  Forcing
  Samba away from host2 after 100 failures (max=100)
  Aug 10 13:36:34 host1 pengine: [1923]: info: get_failcount: Samba has 
  failed
  INFINITY times on host3
  Aug 10 13:36:34 host1 pengine: [1923]: WARN: common_apply_stickiness: 
  Forcing
  Samba away from host3 after 100 failures (max=100)
  Aug 10 13:36:34 host1 pengine: [1923]: info: native_merge_weights:
  BackEndClone: Rolling back scores from Samba
  Aug 10 13:36:34 host1 pengine: [1923]: info: native_color: Unmanaged 
  resource
  Samba allocated to 'nowhere': failed
  Aug 10 13:36:34 host1 pengine: [1923]: WARN: native_create_actions: 
  Attempting
  recovery of resource Samba
  Aug 10 13:36:34 host1 pengine: [1923]: notice: LogActions: Leave resource
  Iscsi:0       (Started host1)
  Aug 10 13:36:34 host1 pengine: [1923]: notice: LogActions: Leave resource
  Virsh:0       (Started host1)
  Aug 10 13:36:34 host1 pengine: [1923]: notice: LogActions: Leave resource
  Iscsi:1       (Started host2)
  Aug 10 13:36:34 host1 pengine: [1923]: notice: LogActions: Leave resource
  Virsh:1       (Started host2)
  

Re: [Linux-HA] about STONITH in HA

2011-08-17 Thread Lars Ellenberg
On Fri, Aug 12, 2011 at 08:58:05AM +1000, Andrew Beekhof wrote:
 On Thu, Aug 11, 2011 at 9:29 PM, Sam Sun sam@ericsson.com wrote:
  Hi All,
  This is Sam for Ericsson IPWorks product maintenance team. We have an 
  urgent problem on the Linux HA solution.
  I am not sure if this is the right mail box, however it is very appreciated 
  if any one can help us.
  Our product has used SLES 10 SP4 X86_64 with HA version 2.1.4-0.24.9.
 
 I'd contact SUSE - you pay them to give you their full attention  :-)
 
  We have a problem in the STONITH implement. There are only two nodes in HA 
  cluster.
     However if there is split brain situation, Two HA nodes will shutdown 
  the peer nodes both at the same time?
 
 Yes
 
     Then we only let STONTH running in one of HA nodes, is this a right 
  configuration?
 
 No.
 
  Is there any Best Practice for STONITH implementation in HA which only has 
  two nodes?

I assume you are already aware of http://ourobengr.com/ha

Besides that, you may want to add a random (or node dependent) timeout
to the stonith agent action, to increase the chance during a split brain
that one shoots the other before being shot itself.

So e.g. you have nodes A and B, and you modify the stonith agent
to always sleep(x) on node A when shooting node B, but to not do any
sleep on node B when shooting node A.

If it is an actual node crash, worst case you need x more seconds for
the stonith action. If it was a split brain, both nodes still alive,
chances are that only A will be shot.

Typically the DC before the split brain will have a slight advantage
anyways, so simultaneously successfully shooting each other should not
be that common.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Renaming a running resource: to do, or not to do?

2011-08-17 Thread Lars Ellenberg
On Fri, Aug 12, 2011 at 08:21:49AM +0200, Ulrich Windl wrote:
  Andrew Beekhof and...@beekhof.net schrieb am 12.08.2011 um 02:53 in 
  Nachricht
 CAEDLWG2pQijx=vwsqyrkfsl74i1xrjidcqcvjs+cvqwmxnm...@mail.gmail.com:
  On Thu, Aug 11, 2011 at 5:37 PM, Ulrich Windl
  ulrich.wi...@rz.uni-regensburg.de wrote:
   Hi!
  
   Using crm shell, you cannot rename a running resource. However I managed 
   to 
  do it via a shadow cib: I renamed the resource in the shadow cib, then 
  committed the shadow cib.
   From the XML changes, I got the impression that the old primitive is 
  removed, and then the new primitive is added. This caused the old resource 
  to 
  be stopped, the new one to be started, and one resource that was a 
  successor 
  in the group to be restarted.
  
   There was a temporary active orphan (the old name) and Configuration 
  WARNINGs found during PE processing, but that vanished when the states 
  changed (transitions completed).
  
   So obviously there is no rename operation for resources. However when you 
  add more and more resources to your cluster, one might find the point where 
  some renaming for consistency might be a good idea. In priniple that could 
  be 
  done online without taking any resource down, but LRM seems to be not 
  prepared for that. Are there any technical reasons for that?
  
  The resource name is the equivalent of a primary key in a database table.
  Its the sole point of comparison when deciding if two resources are
  the same, therefor rename is not a valid operation to consider.
  Any implementation would have to use delete + create underneath.
 
 Hi!
 
 In a database you canb change the primary key as long as you do it
 consistently (in a transaction). I think no CRM would work without
 transactions anyway, so this could be done IMHO. If any name is unique
 between transactions, is see no problem changing names of running
 resources. The olny thing is that one might allow an active monitor to
 finish before renaming the resource. Also: I don't want to rename the
 RAs, I just want to rename the resources.

patches accepted ;-)

What I do:

maintenance-mode on,
reconfigure all I want,
start/stop/move/migrate/whatever, by hand, if I want/need that
reprobe and cleanup until all lrmd and pacemaker know the new state of
the universe,
maintenance-mode off

For simple rename-only operations, as you seem to need,
that can easily be scripted.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems