Re: [Linux-ha-dev] Announcing - the Assimilation monitoring system - a sub-project of Linux-HA
On 08/16/2011 05:08 PM, Angus Salkeld wrote: On Fri, Aug 12, 2011 at 03:11:36PM -0600, Alan Robertson wrote: Hi, Back last November or so, I started work on a new monitoring project - for monitoring servers and services. It's aims are: - Scalable virtually without limit - tens of thousands of servers is not a problem - Easy installation and upkeep - includes integrated discovery of servers, services and switches - without setting off security alarms ;-) This project isn't ready for a public release yet (it's in a fairly early stage), but it seemed worthwhile to let others know that the project exists, and start getting folks to read over the code, and perhaps begin to play with it a bit as well. The project has two arenas of operation: nanoprobes - which run in (nearly) every monitored machine Why not matahari (http://matahariproject.org/)? Collective management - running in a central server (or HA cluster). Quite simerlar to http://pacemaker-cloud.org/. Seems a shame not to be working together. -Angus This is a set of ideas I've been working on for the last four years or so. My most grandiose vision of it I called a Data Center Operating System. This is about the same time that Amazon announced their first cloud offering (unknown to me). There are a few hints about it a couple of years ago in my blog. I heard a little about Andrew's project when I announced this back in November. Andrew has made it perfectly clear that he doesn't want to work with me (really, absolutely, abundantly, perfectly, crystal clear) and there is evidence that he doesn't work well with others besides me, that's not a possibility. In the short term I'm not specially concerned with clouds - just with any collection of computers which range from 4 up to and above cloud scale. That includes clouds of course - but we'll get a lot more users at the small scale than we well at cloud scale. There are several reasons for this approach: - Existing monitoring software sucks. - Many more collections of computers besides clouds exist and need help - although this would work very well with clouds This problem has dimensions that a cloud environment doesn't have. In a cloud, all deployment is automated, so you can _know_ what is running where. In a more conventional data center, having a way to discover what's in your data center, and what's running on those servers is important. -- Alan Robertsonal...@unix.sh Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions. - William Wilberforce ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] OCF RA for named
On Tue, Aug 16, 2011 at 08:51:04AM -0600, Serge Dubrouski wrote: On Tue, Aug 16, 2011 at 8:44 AM, Dejan Muhamedagic de...@suse.de wrote: Hi Serge, On Fri, Aug 05, 2011 at 08:19:52AM -0600, Serge Dubrouski wrote: No interest? Probably not true :) It's just that recently I've been away for a while and in between really swamped with my daily work. I'm trying to catch up now, but it may take a while. In the meantime, I'd like to ask you about the motivation. DNS already has a sort of redundancy built in through its primary/secondary servers. That redundancy doesn't work quite well. Yes you can have primary and secondary servers configured in resolv.conf but if primary is down resolver waits till request times out for the primary server till it sends a request to the secondary one. The dealy can be up to 30 seconds and impacts some applications pretty badly, This is standard behaviour for Linux, Solaris for example works differently and isn't impacted by this issue. Works around are having caching DNS server working locally or having primary DNS server highly available with using Pacemaker :-) Here is what man page for resolv.conf says: nameserver Name server IP address Internet address (in dot notation) of a name server that the resolver should query. Up to MAXNS (currently 3, see resolv.h) name servers may be listed, one per keyword. If there are multiple servers, the resolver library queries them in the order listed. If no nameserver entries are present, the default is to use the name server on the local machine. *(The algorithm used is to try a name server, and if the query times out, try the next, until out of name servers, then repeat trying all the name servers until a maximum number of retries are made.)* options timeout:2 attempts:5 rotate but yes, it is still a valid use case to have a clustered primary name server, and possibly multiple backups. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Announcing - the Assimilation monitoring system - a sub-project of Linux-HA
On Wed, Aug 17, 2011 at 09:51:14AM -0600, Alan Robertson wrote: On 08/16/2011 05:08 PM, Angus Salkeld wrote: On Fri, Aug 12, 2011 at 03:11:36PM -0600, Alan Robertson wrote: Hi, Back last November or so, I started work on a new monitoring project - for monitoring servers and services. It's aims are: - Scalable virtually without limit - tens of thousands of servers is not a problem - Easy installation and upkeep - includes integrated discovery of servers, services and switches - without setting off security alarms ;-) This project isn't ready for a public release yet (it's in a fairly early stage), but it seemed worthwhile to let others know that the project exists, and start getting folks to read over the code, and perhaps begin to play with it a bit as well. The project has two arenas of operation: nanoprobes - which run in (nearly) every monitored machine Why not matahari (http://matahariproject.org/)? Collective management - running in a central server (or HA cluster). Quite simerlar to http://pacemaker-cloud.org/. Seems a shame not to be working together. -Angus This is a set of ideas I've been working on for the last four years or so. My most grandiose vision of it I called a Data Center Operating System. This is about the same time that Amazon announced their first cloud offering (unknown to me). There are a few hints about it a couple of years ago in my blog. I heard a little about Andrew's project when I announced this back in November. Andrew has made it perfectly clear that he doesn't want to work with me (really, absolutely, abundantly, perfectly, crystal clear) and there is evidence that he doesn't work well with others besides me, that's not a possibility. Oops, seems like I have really stepped in it. Sorry for bringing this up again. In the short term I'm not specially concerned with clouds - just with any collection of computers which range from 4 up to and above cloud scale. That includes clouds of course - but we'll get a lot more users at the small scale than we well at cloud scale. There are several reasons for this approach: - Existing monitoring software sucks. - Many more collections of computers besides clouds exist and need help - although this would work very well with clouds This problem has dimensions that a cloud environment doesn't have. In a cloud, all deployment is automated, so you can _know_ what is running where. In a more conventional data center, having a way to discover what's in your data center, and what's running on those servers is important. Well technically this could be easily done in one project, but it seems that's not going to happen (with everyone working on it). -Angus -- Alan Robertsonal...@unix.sh Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions. - William Wilberforce ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Announcing - the Assimilation monitoring system - a sub-project of Linux-HA
On 08/17/2011 07:16 PM, Angus Salkeld wrote: On Wed, Aug 17, 2011 at 09:51:14AM -0600, Alan Robertson wrote: On 08/16/2011 05:08 PM, Angus Salkeld wrote: On Fri, Aug 12, 2011 at 03:11:36PM -0600, Alan Robertson wrote: Hi, Back last November or so, I started work on a new monitoring project - for monitoring servers and services. It's aims are: - Scalable virtually without limit - tens of thousands of servers is not a problem - Easy installation and upkeep - includes integrated discovery of servers, services and switches - without setting off security alarms ;-) This project isn't ready for a public release yet (it's in a fairly early stage), but it seemed worthwhile to let others know that the project exists, and start getting folks to read over the code, and perhaps begin to play with it a bit as well. The project has two arenas of operation: nanoprobes - which run in (nearly) every monitored machine Why not matahari (http://matahariproject.org/)? Collective management - running in a central server (or HA cluster). Quite simerlar to http://pacemaker-cloud.org/. Seems a shame not to be working together. -Angus This is a set of ideas I've been working on for the last four years or so. My most grandiose vision of it I called a Data Center Operating System. This is about the same time that Amazon announced their first cloud offering (unknown to me). There are a few hints about it a couple of years ago in my blog. I heard a little about Andrew's project when I announced this back in November. Andrew has made it perfectly clear that he doesn't want to work with me (really, absolutely, abundantly, perfectly, crystal clear) and there is evidence that he doesn't work well with others besides me, that's not a possibility. Oops, seems like I have really stepped in it. Sorry for bringing this up again. In the short term I'm not specially concerned with clouds - just with any collection of computers which range from 4 up to and above cloud scale. That includes clouds of course - but we'll get a lot more users at the small scale than we well at cloud scale. There are several reasons for this approach: - Existing monitoring software sucks. - Many more collections of computers besides clouds exist and need help - although this would work very well with clouds This problem has dimensions that a cloud environment doesn't have. In a cloud, all deployment is automated, so you can _know_ what is running where. In a more conventional data center, having a way to discover what's in your data center, and what's running on those servers is important. Well technically this could be easily done in one project, but it seems that's not going to happen (with everyone working on it). -Angus Linux is fairly described as an ecosystem. Differing branches and methods of solving a given problem are tried, and the one with the most backing and merit wins. It's part of what makes open-source what it is. So, from my point of view, best of luck to both. :) -- Digimer E-Mail: digi...@alteeve.com Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org At what point did we forget that the Space Shuttle was, essentially, a program that strapped human beings to an explosion and tried to stab through the sky with fire and math? ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-HA] mysql problem when enabling semi-synchronous replication with corosync, mysql_home variable is mandatory
Hi folks, Today I encoutered some difficulties to use mysql resource in a semi-synchronous replication with MySQL and corosync. As you probably know to use semi-synchronous replication, mysql plugins are mandatory. If you try to start mysql without specifying the mysql base directory (basedir) you will get some error messages and mysql will shutdown immediately as demonstrated below [root@mysql-qual1 heartbeat]# /u00/app/mysql/product/mysql-5.5.15/bin/mysqld --defaults-file=/etc/my.cnf --pid_file=/u00/app/mysql/admin/mysqld1/socket/mysqld1.pid --socket=/u00/app/mysql/admin/mysqld1/socket/mld1.sock --datadir=/u01/mysqldata/mysqld1 --user=mysql --port=33001 110816 18:16:32 [ERROR] Can't find messagefile '/usr/local/mysql/share/errmsg.sys' 110816 18:16:32 [Note] Plugin 'FEDERATED' is disabled. 110816 18:16:32 [ERROR] 110816 18:16:32 [Warning] Couldn't load plugin named 'rpl_semi_sync_master' with soname 'semisync_master.so'. 110816 18:16:32 [ERROR] 110816 18:16:32 [Warning] Couldn't load plugin named 'rpl_semi_sync_slave' with soname 'semisync_slave.so'. 110816 18:16:32 InnoDB: The InnoDB memory heap is disabled 110816 18:16:32 InnoDB: Mutexes and rw_locks use GCC atomic builtins 110816 18:16:32 InnoDB: Compressed tables use zlib 1.2.3 110816 18:16:32 InnoDB: Using Linux native AIO 110816 18:16:32 InnoDB: Initializing buffer pool, size = 128.0M 110816 18:16:32 InnoDB: Completed initialization of buffer pool 110816 18:16:32 InnoDB: highest supported file format is Barracuda. 110816 18:16:32 InnoDB: Waiting for the background threads to start 110816 18:16:33 InnoDB: 1.1.8 started; log sequence number 1682335 110816 18:16:33 [ERROR] Aborting 110816 18:16:33 InnoDB: Starting shutdown... 110816 18:16:33 InnoDB: Shutdown completed; log sequence number 1682335 110816 18:16:33 [Note] In order to solve this problem it is mandatory to set the basedir which is not done in the file /usr/lib/ocf/resource.d/heartbeat/mysql. Otherwise your mysql resource won't be able to start. Additionally it would be really clever to have the possibility to set the port without using the additional_parameters. I made some tests by setting the two additional variables below and it's perfectly working. export OCF_RESKEY_binary=/u00/app/mysql/product/mysql-5.5.15/bin/mysqld export OCF_RESKEY_client_binary=/u00/app/mysql/product/mysql-5.5.15/bin/mysql export OCF_RESKEY_config=/u00/app/mysql/etc/my.cnf export OCF_RESKEY_datadir=/u01/mysqldata/mysqld1 export OCF_RESKEY_user=mysql export OCF_RESKEY_group=mysql export OCF_RESKEY_test_table=mysql.user export OCF_RESKEY_test_user=root export OCF_RESKEY_test_passwd=manager export OCF_RESKEY_log=/u00/app/mysql/admin/mysqld1/log/mysqld1.log export OCF_RESKEY_pid=/u00/app/mysql/admin/mysqld1/socket/mysqld1.pid export OCF_RESKEY_socket=/u00/app/mysql/admin/mysqld1/socket/mysqld1.sock export OCF_RESKEY_port=33001 export OCF_RESKEY_basedir=/u00/app/mysql/product/mysql-5.5.15 In order to be able to use it in the corosync environnement, the following changes are mandatory in the file /usr/lib/ocf/resource.d/heartbeat/mysql mysql_start() { ... ${OCF_RESKEY_binary} --defaults-file=$OCF_RESKEY_config \ --pid-file=$OCF_RESKEY_pid \ --socket=$OCF_RESKEY_socket \ --datadir=$OCF_RESKEY_datadir \ --basedir=$OCF_RESKEY_basedir \ --port=$OCF_RESKEY_port \ --user=$OCF_RESKEY_user $OCF_RESKEY_additional_parameters /dev/null 21 rc=$? ... Best regards Greg Gregory Steulet Senior Consultant Delivery Manager gregory.steu...@dbi-services.com +41 79 963 43 69 www.dbi-services.com dbi services Lausanne chemin Maillefer 36 | CH-1052 Le Mont-sur-Lausanne Blog www.dbi-services.com/blog Follow dbi services! ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Bug or feature: crm_mon -1n
Hi, I wonder why no groups are displayed for crm_mon -1n: The resources are ordered by node where they are running on, but still some resources belong to groups. However the groups are no longer displayed with -n (node view). That display doesn't make much sense with groups IMHO. Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] IPaddr not work correctly on large routing tables on Linux
Hi, On Fri, Jun 24, 2011 at 08:10:58PM +0200, Lars Ellenberg wrote: On Fri, Jun 24, 2011 at 10:40:36AM +0200, Dejan Muhamedagic wrote: Hi, On Tue, May 17, 2011 at 04:10:14PM +0400, Alexander Polyakov wrote: IPaddr not work correctly on large routing tables on Linux If the contents of the routing table is large (eg, full view, derived from the quagga) then the node can not remove the public IP address. In the top hang two Process: route and grep remove ip . Because of this, there is no valid switch node. Just how big is your routing table? What does time route -n show? Oh, a full routing table can get quite large. The BGP table currently has 350k entries ;-) Besides, there is no point in doing route -n | grep $IP first, and then doing route -n del -host $IP The grep is wrong, anyways, as it may match more than just the intended IP. Testing on non-empty output ([ `x | y` ]) instead of the exit code (x | y /dev/null) looks bit strange as well. And, if it's not there, route del may complain, but all is good and shiny, still. So I'm fine with just dropping that if route -n | grep ; then. And it has been removed. Cheers, Dejan Possibly IPaddr2 works better for Alexander? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Stopping heartbeat on secondary node causes primary to fail
On Tue, Aug 09, 2011 at 04:03:40PM -0700, Chris Huber-Lantz wrote: Thanks for taking the time to post, I'm hoping you could clarify a bit. From the linux-ha documentation on the ucast directive: Note that ucast directives which go to the local machine are effectively ignored. This allows the ha.cf directives on all machines to be identical. This would seem to contradict what you are saying about the active primary needing to see its own heartbeat in order to hold on to its resources. Adding to this is the fact that if we restart heartbeat on the primary it comes up as normal and not seeing a heartbeat from the secondary, assumes control of the resources. If the problem was lack of a local heartbeat I don't understand why the server could come back up like this. You would need to send some logs, from everything ok, just before you shut down one node, to when the remaining node shuts down its resources. This seems to be an haresources style cluster? Is haresources identical on both nodes? It has to be. I assume ha.cf is reciprocal with regard to the peer address used on the udp line? I would recommend to just put both IPs as udp lines in the ha.cf, so you can have ha.cf identical on both nodes as well. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Question about max_child_count
On Mon, Aug 08, 2011 at 12:04:49PM +0200, alain.mou...@bull.net wrote: Hi I wonder if the default value of max_child_count (4) has been increased on last Pacemaker releases ? or if it is now possible to tune it ? The heartbeat init script does something like this: $LRMADMIN -p max-children $LRMD_MAX_CHILDREN where LRMADMIN is simply lrmadmin, and LRMD_MAX_CHILDREN is a variable supposed to be set in one of the sourced sysconfig or defaults files. You apparently can only tune it once lrmd is up and running, lrmd does not read any configuration on its own (correct me if I'm wrong). -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ocf:heartbeat:Xen: shutdown timeout
On Thu, Aug 11, 2011 at 10:56:15AM +0200, Ulrich Windl wrote: Hi! Sorry, if this has been discussed before, but I think ocf:heartbeat:Xen does not what the documentations says about timeout: parameter name=shutdown_timeout longdesc lang=en The Xen agent will first try an orderly shutdown using xm shutdown. Should this not succeed within this timeout, the agent will escalate to xm destroy, forcibly killing the node. If this is not set, it will default to two-third of the stop action timeout. Setting this value to 0 forces an immediate destroy. /longdesc The code to set the timeout is this: if [ -n $OCF_RESKEY_shutdown_timeout ]; then timeout=$OCF_RESKEY_shutdown_timeout elif [ -n $OCF_RESKEY_CRM_meta_timeout ]; then # Allow 2/3 of the action timeout for the orderly shutdown # (The origin unit is ms, hence the conversion) timeout=$((OCF_RESKEY_CRM_meta_timeout/1500)) else timeout=60 fi The primitive was configured like this: primitive prm_v02_xen ocf:heartbeat:Xen params xmfile=/etc/xen/vm/v02 op start timeout=300 op stop timeout=300 op monitor interval=1200 timeout=90 So I'd expect 2/3rds of 300s to be 200s. However the syslog says: Aug 11 10:14:37 h01 Xen[25140]: INFO: Xen domain v02 will be stopped (timeout: 13s) Aug 11 10:14:50 h01 Xen[25140]: WARNING: Xen domain v02 will be destroyed! According to the code, that's printed here: if [ $timeout -gt 0 ]; then ocf_log info Xen domain $dom will be stopped (timeout: ${timeout}s) So I guess something is wrong. There has been a pacemaker bug (or was it lrmd bug?) that caused the stop action to be sometimes passed an incorrect *_CRM_meta_* environment. 2 / 1500 happens to end up being 13, so maybe somehow the timeout used some default value of 20 seconds? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem with kvm virtual machine and cluster
On Thu, Aug 11, 2011 at 04:04:36PM +1000, Andrew Beekhof wrote: On Wed, Aug 10, 2011 at 11:15 PM, Maloja01 maloj...@arcor.de wrote: The order constraints do work as I assume, but I guess that you run into a pifall: A clone is marked as up, if one instance in the cluster is started successfully. The order does not say, that the clone on the same node must be up. Use a colocation constraint to have that Kind regards Fabian On 08/10/2011 01:43 PM, i...@umbertocarrara.it wrote: hi, excuse me for my poor english, i use google to help me in traslation and I am a newbie in clustering :-). I'm trying to start a cluster with tree nodes for virtualizzation, I have used a how-to that I found at http://www.linbit.com/support/ha-kvm.pdf to configure the cluster, volumes of vm are shared on openfiler cluster on iscsi that works well. vm start ok in hosts if I'm out of the cluster. The problem is that the vm start before libvirt and open-iscsi initiator I have set a order rule but seems wont work. after when services are started the cluster can not restart the machine so the output of crm_mon -1 is Last updated: Wed Aug 10 12:40:20 2011 Stack: openais Current DC: host1 - partition with quorum Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b 3 Nodes configured, 3 expected votes 2 Resources configured. Online: [ host1 host2 host3 ] Clone Set: BackEndClone Started: [ host1 host2 host3 ] Samba (ocf::heartbeat:VirtualDomain) Started [ host1 host2 host3 ] Failed actions: Samba_monitor_0 (node=host1, call=15, rc=1, status=complete): unknown error Samba_stop_0 (node=host1, call=16, rc=1, status=complete): unknown error Samba_monitor_0 (node=host2, call=12, rc=1, status=complete): unknown error Samba_stop_0 (node=host2, call=13, rc=1, status=complete): unknown error Samba_monitor_0 (node=host3, call=12, rc=1, status=complete): unknown error Samba_stop_0 (node=host3, call=13, rc=1, status=complete): unknown error this is my cluster config: root@host1:~# crm configure show node host1 \ attributes standby=on node host2 \ attributes standby=on node host3 \ attributes standby=on primitive Iscsi lsb:open-iscsi \ op monitor interval=30 primitive Samba ocf:heartbeat:VirtualDomain \ params config=/etc/libvirt/qemu/samba.iso.xml \ meta allow-migrate=true \ op monitor interval=30 primitive Virsh lsb:libvirt-bin \ op monitor interval=30 group BackEnd Iscsi Virsh clone BackEndClone BackEnd \ meta target-role=Started colocation SambaOnBackEndClone inf: Samba BackEndClone order SambaBeforeBackEndClone inf: BackEndClone Samba I think you want to reverse those to do what their id implies: colocation SambaOnBackEndClone inf: BackEndClone Samba order SambaBeforeBackEndClone inf: Samba BackEndClone property $id=cib-bootstrap-options \ dc-version=1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b \ cluster-infrastructure=openais \ expected-quorum-votes=3 \ stonith-enabled=false \ no-quorum-policy=ignore \ default-action-timeout=100 \ last-lrm-refresh=1312970592 rsc_defaults $id=rsc-options \ resource-stickiness=200 my log is: Aug 10 13:36:34 host1 pengine: [1923]: info: get_failcount: Samba has failed INFINITY times on host1 Aug 10 13:36:34 host1 pengine: [1923]: WARN: common_apply_stickiness: Forcing Samba away from host1 after 100 failures (max=100) Aug 10 13:36:34 host1 pengine: [1923]: info: get_failcount: Samba has failed INFINITY times on host2 Aug 10 13:36:34 host1 pengine: [1923]: WARN: common_apply_stickiness: Forcing Samba away from host2 after 100 failures (max=100) Aug 10 13:36:34 host1 pengine: [1923]: info: get_failcount: Samba has failed INFINITY times on host3 Aug 10 13:36:34 host1 pengine: [1923]: WARN: common_apply_stickiness: Forcing Samba away from host3 after 100 failures (max=100) Aug 10 13:36:34 host1 pengine: [1923]: info: native_merge_weights: BackEndClone: Rolling back scores from Samba Aug 10 13:36:34 host1 pengine: [1923]: info: native_color: Unmanaged resource Samba allocated to 'nowhere': failed Aug 10 13:36:34 host1 pengine: [1923]: WARN: native_create_actions: Attempting recovery of resource Samba Aug 10 13:36:34 host1 pengine: [1923]: notice: LogActions: Leave resource Iscsi:0 (Started host1) Aug 10 13:36:34 host1 pengine: [1923]: notice: LogActions: Leave resource Virsh:0 (Started host1) Aug 10 13:36:34 host1 pengine: [1923]: notice: LogActions: Leave resource Iscsi:1 (Started host2) Aug 10 13:36:34 host1 pengine: [1923]: notice: LogActions: Leave resource Virsh:1 (Started host2)
Re: [Linux-HA] about STONITH in HA
On Fri, Aug 12, 2011 at 08:58:05AM +1000, Andrew Beekhof wrote: On Thu, Aug 11, 2011 at 9:29 PM, Sam Sun sam@ericsson.com wrote: Hi All, This is Sam for Ericsson IPWorks product maintenance team. We have an urgent problem on the Linux HA solution. I am not sure if this is the right mail box, however it is very appreciated if any one can help us. Our product has used SLES 10 SP4 X86_64 with HA version 2.1.4-0.24.9. I'd contact SUSE - you pay them to give you their full attention :-) We have a problem in the STONITH implement. There are only two nodes in HA cluster. However if there is split brain situation, Two HA nodes will shutdown the peer nodes both at the same time? Yes Then we only let STONTH running in one of HA nodes, is this a right configuration? No. Is there any Best Practice for STONITH implementation in HA which only has two nodes? I assume you are already aware of http://ourobengr.com/ha Besides that, you may want to add a random (or node dependent) timeout to the stonith agent action, to increase the chance during a split brain that one shoots the other before being shot itself. So e.g. you have nodes A and B, and you modify the stonith agent to always sleep(x) on node A when shooting node B, but to not do any sleep on node B when shooting node A. If it is an actual node crash, worst case you need x more seconds for the stonith action. If it was a split brain, both nodes still alive, chances are that only A will be shot. Typically the DC before the split brain will have a slight advantage anyways, so simultaneously successfully shooting each other should not be that common. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: Renaming a running resource: to do, or not to do?
On Fri, Aug 12, 2011 at 08:21:49AM +0200, Ulrich Windl wrote: Andrew Beekhof and...@beekhof.net schrieb am 12.08.2011 um 02:53 in Nachricht CAEDLWG2pQijx=vwsqyrkfsl74i1xrjidcqcvjs+cvqwmxnm...@mail.gmail.com: On Thu, Aug 11, 2011 at 5:37 PM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Hi! Using crm shell, you cannot rename a running resource. However I managed to do it via a shadow cib: I renamed the resource in the shadow cib, then committed the shadow cib. From the XML changes, I got the impression that the old primitive is removed, and then the new primitive is added. This caused the old resource to be stopped, the new one to be started, and one resource that was a successor in the group to be restarted. There was a temporary active orphan (the old name) and Configuration WARNINGs found during PE processing, but that vanished when the states changed (transitions completed). So obviously there is no rename operation for resources. However when you add more and more resources to your cluster, one might find the point where some renaming for consistency might be a good idea. In priniple that could be done online without taking any resource down, but LRM seems to be not prepared for that. Are there any technical reasons for that? The resource name is the equivalent of a primary key in a database table. Its the sole point of comparison when deciding if two resources are the same, therefor rename is not a valid operation to consider. Any implementation would have to use delete + create underneath. Hi! In a database you canb change the primary key as long as you do it consistently (in a transaction). I think no CRM would work without transactions anyway, so this could be done IMHO. If any name is unique between transactions, is see no problem changing names of running resources. The olny thing is that one might allow an active monitor to finish before renaming the resource. Also: I don't want to rename the RAs, I just want to rename the resources. patches accepted ;-) What I do: maintenance-mode on, reconfigure all I want, start/stop/move/migrate/whatever, by hand, if I want/need that reprobe and cleanup until all lrmd and pacemaker know the new state of the universe, maintenance-mode off For simple rename-only operations, as you seem to need, that can easily be scripted. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems