Re: [Linux-ha-dev] patch: RA conntrackd: Request state info on startup
currently doing another conntrackd project and therefore using the Found a minor issue: When the active host is fenced and returns to the cluster, it does not request the current connection tracking states. Therefore state information might be lost. This patch fixes that. Any comments? I'm not sure what do you mean by active host. A node which is running conntrackd or a node which is running conntrackd master instance? Erm, yeah. Sorry for not being precise. I mean the node running the master instance. Successfully tested with debian squeeze version 0.9.14. Looks OK to me. I'll push it to the repository. Thanks. Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] patch: RA conntrackd: Request state info on startup
Hi people currently doing another conntrackd project and therefore using the code once again (jippie :)). Found a minor issue: When the active host is fenced and returns to the cluster, it does not request the current connection tracking states. Therefore state information might be lost. This patch fixes that. Any comments? Successfully tested with debian squeeze version 0.9.14. Regards Dominik conntrackd.patch Description: Binary data ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] Antw: Re: Forkbomb not initiating failover
Node level failure is detected on the communications layer, ie hearbeat or corosync. That software is run with realtime priority. So it keeps working just fine (use tcpdump on the healthy node to verify). So pacemaker on the healthy node does now know that the other node has a problem and therefore does not initiate failover. We had this discussion back in 2010, maybe you also want to refer to that: http://oss.clusterlabs.org/pipermail/pacemaker/2010-February/004739.html Regards Dominik On 07/08/2011 03:23 PM, Warnke, Eric E wrote: If the fork bomb is preventing the system from spawning a health check, it would seem like the most intelligent course of action would be to presume that it failed and act accordingly. -Eric On 7/8/11 8:38 AM, Lars Marowsky-Breel...@suse.de wrote: On 2011-07-08T14:10:09, Gianluca Cecchigianluca.cec...@gmail.com wrote: So that each node has to write to its dedicated part of it and read from the other ones. If one node doesn't update its portion it is then detected by the others and it is fenced after a configurable number of misses... Does pacemaker provide some sort of this configuration? external/sbd as a fencing mechanism provides this, but that is not the same as a load system health check at all. Though tieing into that would make sense, yes. Regards, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: Forkbomb not initiating failover
On 08/29/2011 09:51 AM, Dominik Klein wrote: Node level failure is detected on the communications layer, ie hearbeat or corosync. That software is run with realtime priority. So it keeps working just fine (use tcpdump on the healthy node to verify). So pacemaker on the healthy node does now know woops, this was supposed to say not know that the other node has a problem and therefore does not initiate failover. We had this discussion back in 2010, maybe you also want to refer to that: http://oss.clusterlabs.org/pipermail/pacemaker/2010-February/004739.html Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] Patch: VirtualDomain - fix probe if config is not on shared storage
There did not have to be a negative location constraint up to now, because the cluster took care of that. Only because it didn't work correctly. Okay. Actually, this is a wanted setup. It happened that VMs configs were changed in ways that lead to a VM not being startable any more. For that case, they wanted to be able to start the old config on the other node. Please, notice _they_ vs. _me_ here :) Wow! So, they can have different configurations at different nodes. Agreed, wow! The only issue you may have with this cluster is if the administrator erronously removes a config on some node, right? And that then some time afterwards the cluster does a probe on that node. And then again the cluster wants to fail over this VM to that node. And that at this point in time no other node can run this VM and that it is going to repeatedly try to start and fail. And that failed start is fatal isn't configured. No doubt that this could happen, but what's the probability? And, finally, that doesn't look like a well maintained cluster. I guess this is something _they_ have to live with then. At first glance, I honestly thought this was a change in the agent that introduced a regression that not only this configuration would hit, but you made me realize that it does not, but that it does improve the agent for sane setups. My vote goes for your patch, ie stop no config = return SUCCESS Thanks Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Patch: VirtualDomain - fix probe if config is not on shared storage
On 06/27/2011 11:09 AM, Dejan Muhamedagic wrote: Hi Dominik, On Fri, Jun 24, 2011 at 03:50:40PM +0200, Dominik Klein wrote: Hi Dejan, this way, the cluster never learns that it can't start a resource on that node. This resource depends on shared storage. So, the cluster won't try to start it unless the shared storage resource is already running. This is something that needs to be specified using either a negative preference location constraint or asymmetrical cluster. There's no need for yet another mechanism (the extra parameter) built into the resource agent. It's really an overkill. As requested on IRC, I describe my setup and explain why I think this is a regression. 2 node cluster with a bunch of drbd devices. Each /dev/drbdXX is used as a block device of a VM. The VMs configuration files are not on shared storage but have to be copied manually. So it happened that during configuration of a VM, the admin forgot to copy the configuration file to node2. The machine's DRBD was configured though. So the cluster decided to promote the VMs DRBD on node2 and then start the master-colocated and ordered VM. With the agent before the mentioned patch, during probe of a newly configured resource, the cluster would have learned that the VM is not available on one of the nodes (ERR_INSTALLED), so it would never start the resource there. Now it sees NOT_RUNNING on all nodes during probe and may decide to start the VM on a node where it cannot run. That, with the current version of the agent, leads to a failed start, a failed stop during recovery and therefore: an unnecessary stonith operation. With Dejan's patch, it would still see NOT_RUNNING during probe, but at least the stop would succeed. So the difference to the old version would be that we had an unnecessary failed start on the node that does not have the VM but it would not harm the node and I'd be fine with applying that patch. There's a case though that might stop the vm from running (for an amount of time). And that is if start-failure-is-fatal is false. Then we would have $migration-threshold failed start/succeeded stop iterations while the VMs service would not be running. Of course I do realize that the initial fault is a human one. but the cluster used to protect from this, does not any more and that's why I think this is a regression. I think the correct way to fix this is to still return ERR_INSTALLED during probe unless the cluster admin configures that the VMs config is on shared storage. Finding out about resource states on different nodes is what the probe was designed to do, was it not? And we work around that in this resource agent just to support certain setups. Regards Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Patch: VirtualDomain - fix probe if config is not on shared storage
With the agent before the mentioned patch, during probe of a newly configured resource, the cluster would have learned that the VM is not available on one of the nodes (ERR_INSTALLED), so it would never start the resource there. This is exactly the problem with shared storage setups, where such an exit code can prevent resource from ever being started on a node which is otherwise perfectly capable of running that resource. I see and understand that that, too, is a valid setup and concern. But really, if a resource can _never_ run on a node, then there should be a negative location constraint or the cluster should be setup as asymmetrical. There did not have to be a negative location constraint up to now, because the cluster took care of that. Now, I understand that in your case, it is actually due to the administrator's fault. Yes, that's how I noticed the problem with the agent. This particular setup is a special case of shared storage. The images are on shared storage, but the configurations are local. I think that you really need to make sure that the configurations are present where they need to be. Best would be that the configuration is kept on the storage along with the corresponding VM image. Since you're using a raw device as image, that's obviously not possible. Otherwise, use csync2 or similar to keep files in sync. Actually, this is a wanted setup. It happened that VMs configs were changed in ways that lead to a VM not being startable any more. For that case, they wanted to be able to start the old config on the other node. I agree that the cases that lead me to finding this change in the agent are cases that could have been solved with better configuration and that your suggestions make sense. Still, I feel that the change introduces a new way of doing things that might affect running and working setups in unintended ways. I refuse to believe that I am the only one doing HA VMs like this (although of course I might be wrong on that, too ...). Regards Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Patch: VirtualDomain - fix probe if config is not on shared storage
I'm not sure my fix is correct. According to https://github.com/ClusterLabs/resource-agents/commit/96ff8e9ad3d4beca7e063beef156f3b838a798e1#heartbeat/VirtualDomain this is a regression which was introduced in April '11. So the fix should be the other way around: Introduce a parameter that let's the user configure the config file _is_ on shared storage and if this is false or unset, return to the old behaviour of returning ERR_INSTALLED. Regards Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] Patch: VirtualDomain - fix probe if config is not on shared storage
This fixes the issue described yesterday. Comments? Regards Dominik exporting patch: # HG changeset patch # User Dominik Klein dominik.kl...@gmail.com # Date 1308909599 -7200 # Node ID 2b1615aaca2c90f2f4ab93eb443e5902906fb28a # Parent 7a11934b142d1daf42a04fbaa0391a3ac47cee4c RA VirtualDomain: Fix probe if config is not on shared storage diff -r 7a11934b142d -r 2b1615aaca2c heartbeat/VirtualDomain --- a/heartbeat/VirtualDomain Fri Feb 25 12:23:17 2011 +0100 +++ b/heartbeat/VirtualDomain Fri Jun 24 11:59:59 2011 +0200 @@ -19,9 +19,11 @@ # Defaults OCF_RESKEY_force_stop_default=0 OCF_RESKEY_hypervisor_default=$(virsh --quiet uri) +OCF_RESKEY_config_on_shared_storage_default=1 : ${OCF_RESKEY_force_stop=${OCF_RESKEY_force_stop_default}} : ${OCF_RESKEY_hypervisor=${OCF_RESKEY_hypervisor_default}} +: ${OCF_RESKEY_config_on_shared_storage=${OCF_RESKEY_config_on_shared_storage_default}} ### ## I'd very much suggest to make this RA use bash, @@ -421,8 +423,8 @@ # check if we can read the config file (otherwise we're unable to # deduce $DOMAIN_NAME from it, see below) if [ ! -r $OCF_RESKEY_config ]; then - if ocf_is_probe; then - ocf_log info Configuration file $OCF_RESKEY_config not readable during probe. + if ocf_is_probe ocf_is_true $OCF_RESKEY_config_on_shared_storage; then + ocf_log info Configuration file $OCF_RESKEY_config not readable during probe. Assuming it is on shared storage and therefore reporting VM is not running. else ocf_log error Configuration file $OCF_RESKEY_config does not exist or is not readable. return $OCF_ERR_INSTALLED exporting patch: # HG changeset patch # User Dominik Klein dominik.kl...@gmail.com # Date 1308911272 -7200 # Node ID 312adf2449eb59dcc41686626b1726428d13227b # Parent 2b1615aaca2c90f2f4ab93eb443e5902906fb28a RA VirtualDomain: Add metadata for the new parameter diff -r 2b1615aaca2c -r 312adf2449eb heartbeat/VirtualDomain --- a/heartbeat/VirtualDomain Fri Jun 24 11:59:59 2011 +0200 +++ b/heartbeat/VirtualDomain Fri Jun 24 12:27:52 2011 +0200 @@ -119,6 +119,16 @@ content type=string default= / /parameter +parameter name=config_on_shared_storage unique=0 required=0 +longdesc lang=en +If your VMs configuration file is _not_ on shared storage, so that the config +file not being in place during a probe means that the VM is not installed/runnable +on that node, set this to 0. +/longdesc +shortdesc lang=enSet to 0 if your VMs config file is not on shared storage/shortdesc +content type=boolean default=1 / +/parameter + /parameters actions ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Patch: VirtualDomain - fix probe if config is not on shared storage
Hi Dejan, this way, the cluster never learns that it can't start a resource on that node. I don't consider this a solution. Regards Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] VirtualDomain issue
Hi code snippet from http://hg.linux-ha.org/agents/raw-file/7a11934b142d/heartbeat/VirtualDomain (which I believe is the current version) VirtualDomain_Validate_All() { snip if [ ! -r $OCF_RESKEY_config ]; then if ocf_is_probe; then ocf_log info Configuration file $OCF_RESKEY_config not readable during probe. else ocf_log error Configuration file $OCF_RESKEY_config does not exist or is not readable. return $OCF_ERR_INSTALLED fi fi } snip VirtualDomain_Validate_All || exit $? snip if ocf_is_probe [ ! -r $OCF_RESKEY_config ]; then exit $OCF_NOT_RUNNING fi So, say one node does not have the config, but the cluster decides to run the vm on that node. The probe returns NOT_RUNNING, so the cluster tries to start the vm, that start returns ERR_INSTALLED, the cluster has to try to recover from the start failure, so stop it, but that stop op returns ERR_INSTALLED as well, so we need to be stonith'd. I think this is wrong behaviour. I read the comments about configurations being on shared storage which might not be available at certain points in time and I see the point. But the way this is implemented clearly does not work for everybody. I vote for making this configurable. Unfortunately, due to several reasons, I am not able to contribute this patch myself at the moment. Regards Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] How to start Pacemaker in unmanaged mode ?
Correct me if I'm wrong but I strongly think the following would work: Since you want to _start_ pacemaker in unmanaged mode, I expect all your nodes to be offline. Then delete cib.xml on all nodes but one. On the one remaining, edit cib.xml and put your configuration is_managed=false there. Then start all nodes. Don't see why this shouldn't work. Regards Dominik On 05/10/2011 05:08 PM, Alain.Moulle wrote: I don't think it is authorized at all, we must never write directly in cib.xml , I tried for other needs and it systematically disturbs Pacemaker a lot ! We always have to go through some crm commands or similar (crm_attributes, etc.) but they are not taken in account before the 60s are ended. Alain Dominik Klein a écrit : Just write it to the xml on all nodes? On 05/10/2011 01:23 PM, Alain.Moulle wrote: Sorry I meant directly with is_managed=false of course ! Alain ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- IN-telegence GmbH Oskar-Jäger-Str. 125 50825 Köln Registergericht AG Köln - HRB 34038 USt-ID DE210882245 Geschäftsführende Gesellschafter: Christian Plätke und Holger Jansen ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] How to start Pacemaker in unmanaged mode ?
On 05/11/2011 10:24 AM, Alain.Moulle wrote: Hi Dominik, I just have tried again : service corosync stop on both nodes node1 node2 remove the cib.xml on node2 vi cib.xml on node1 set the property maintenance-mode=true (nvpair id=cib-bootstrap-options-maintenance-mode name=maintenance-mode value=true/) wq! start Pacemaker on node1 and on node2 On node1 : both nodes remain UNCLEAN(offline) at vitam eternam On node2 : crm_mon : Attempting connection to the cluster. at vitam eternam Moreover , the added line for maintenance-mode=true has been removed by Pacemaker/corosync in the cib.xml I tried long time ago to do vi and change some values in the cib.xml, and each time I did that, Pacemaker never start correctly again and I had to reconfigure all the things from scratch so that it start again... I have to admit I only did this back in heartbeat times and so it seems something changed regarding this. So ... Just had a look at a cluster of mine and there are backup copies and .sig files for each cib.xml version. I don't know what pacemaker will do if you just a) removed the history and .sig file, so only cib.xml is in place or b) replaced the (apparently md5) checksum in .sig Worth a shot I think. Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] How to start Pacemaker in unmanaged mode ?
Just write it to the xml on all nodes? On 05/10/2011 01:23 PM, Alain.Moulle wrote: Sorry I meant directly with is_managed=false of course ! Alain ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ocf:pacemaker:ping: dampen
It waits $dampen before changes are pushed to the cib. So that eventually occuring icmp hickups do not produce an unintended failover. At least that's my understanding. Regards Dominik On 04/29/2011 09:22 AM, Ulrich Windl wrote: Hi, I think the description for dampen in OCF:pacemaker:ping (pacemaker-1.1.5-5.5.5 of SLES11 SP1) is too terse: parameter name=dampen unique=0 longdesc lang=en The time to wait (dampening) further changes occur /longdesc shortdesc lang=enDampening interval/shortdesc content type=integer default=5s/ /parameter What does that do? Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ocf:pacemaker:ping: dampen
correcto wow. again! :) ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] New OCF RA: symlink
Am I too paranoid? I don't think you are. Some non-root pratically being able to remove any file is certainly a valid concern. Thing is: I needed an RA that configured a cronjob. Florian suggested writing the symlink RA instead, that could manage symlink. Apparently there was an IRC discussion a couple weeks ago that I was not a part of. So while the symlink RA could also do what I needed, I tried to write that instead of the cronjob RA (which will also come since it will cover some more functions than this one, but that's another story). So anyway, maybe those involved in the first discussion can comment on this, too and share thoughts on how to solve things. Maybe they had already addressed these situations. Regards Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] libglue2 dependency missing in cluster-glue
Mornin Dejan, The reason was that libglue2 and cluster-glue were not installed from the clusterlabs repository, as the rest of the packages were, but instead they were pulled from the original opensuse repository in an older version. This is what I found in pacemaker.spec.in in the repository: Requires(pre): cluster-glue = 1.0.6 Which version of glue was that older version? 0.9.1 Whoa. Can't recall ever seeing that thing. rpm -qRp cluster-glue-0.9-2.1.x86_64.rpm /usr/sbin/groupadd /usr/bin/getent /usr/sbin/useradd /bin/sh rpmlib(PayloadFilesHavePrefix) = 4.0-1 rpmlib(CompressedFileNames) = 3.0.4-1 /bin/bash /bin/sh /usr/bin/perl /usr/bin/python libc.so.6()(64bit) libc.so.6(GLIBC_2.2.5)(64bit) libc.so.6(GLIBC_2.4)(64bit) libcurl.so.4()(64bit) libglib-2.0.so.0()(64bit) liblrm.so.2()(64bit) libnetsnmp.so.15()(64bit) libpils.so.2()(64bit) libplumb.so.2()(64bit) libplumbgpl.so.2()(64bit) libstonith.so.1()(64bit) libxml2.so.2()(64bit) rpmlib(PayloadIsLzma) = 4.4.6-1 That's the old package, from opensuse. Here's the new one (106 from clusterlabs' opensuse 11.2 repository): /usr/sbin/groupadd /usr/bin/getent /usr/sbin/useradd /bin/sh /bin/sh /bin/sh /bin/sh rpmlib(PayloadFilesHavePrefix) = 4.0-1 rpmlib(CompressedFileNames) = 3.0.4-1 /bin/bash /bin/sh /usr/bin/env /usr/bin/perl /usr/bin/python libOpenIPMI.so.0()(64bit) libOpenIPMIposix.so.0()(64bit) libOpenIPMIutils.so.0()(64bit) libbz2.so.1()(64bit) libc.so.6()(64bit) libc.so.6(GLIBC_2.2.5)(64bit) libc.so.6(GLIBC_2.4)(64bit) libcrypto.so.0.9.8()(64bit) libcurl.so.4()(64bit) libdl.so.2()(64bit) libglib-2.0.so.0()(64bit) liblrm.so.2()(64bit) libltdl.so.7()(64bit) libm.so.6()(64bit) libnetsnmp.so.15()(64bit) libopenhpi.so.2()(64bit) libpils.so.2()(64bit) libplumb.so.2()(64bit) libplumbgpl.so.2()(64bit) librt.so.1()(64bit) libstonith.so.1()(64bit) libuuid.so.1()(64bit) libxml2.so.2()(64bit) libz.so.1()(64bit) rpmlib(PayloadIsLzma) = 4.4.6-1 I don't see libglue there. Regards Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] libglue2 dependency missing in cluster-glue
This is what I found in pacemaker.spec.in in the repository: Requires(pre): cluster-glue = 1.0.6 The 1.0.10 rpm from clusterlabs for opensuse 11.2 just says cluster-glue afaict: rpm -qR pacemaker cluster-glue resource-agents python = 2.4 libpacemaker3 = 1.0.10-1.4 libesmtp net-snmp rpmlib(PayloadFilesHavePrefix) = 4.0-1 rpmlib(CompressedFileNames) = 3.0.4-1 /bin/bash /bin/sh /usr/bin/env /usr/bin/python libbz2.so.1()(64bit) libc.so.6()(64bit) libc.so.6(GLIBC_2.2.5)(64bit) libc.so.6(GLIBC_2.3)(64bit) libc.so.6(GLIBC_2.4)(64bit) libccmclient.so.1()(64bit) libcib.so.1()(64bit) libcoroipcc.so.4()(64bit) libcrmcluster.so.1()(64bit) libcrmcommon.so.2()(64bit) libcrypt.so.1()(64bit) libcrypto.so.0.9.8()(64bit) libdl.so.2()(64bit) libesmtp.so.5()(64bit) libgcrypt.so.11()(64bit) libglib-2.0.so.0()(64bit) libgnutls.so.26()(64bit) libgnutls.so.26(GNUTLS_1_4)(64bit) libgpg-error.so.0()(64bit) libhbclient.so.1()(64bit) liblrm.so.2()(64bit) libltdl.so.7()(64bit) libm.so.6()(64bit) libncurses.so.5()(64bit) libnetsnmp.so.15()(64bit) libnetsnmpagent.so.15()(64bit) libnetsnmphelpers.so.15()(64bit) libnetsnmpmibs.so.15()(64bit) libpam.so.0()(64bit) libpam.so.0(LIBPAM_1.0)(64bit) libpe_rules.so.2()(64bit) libpe_status.so.2()(64bit) libpengine.so.3()(64bit) libperl.so()(64bit) libpils.so.2()(64bit) libplumb.so.2()(64bit) libpopt.so.0()(64bit) libpthread.so.0()(64bit) libpthread.so.0(GLIBC_2.2.5)(64bit) librpm.so.0()(64bit) librpmio.so.0()(64bit) librt.so.1()(64bit) libsensors.so.3()(64bit) libstonith.so.1()(64bit) libstonithd.so.0()(64bit) libtransitioner.so.1()(64bit) libwrap.so.0()(64bit) libxml2.so.2()(64bit) libxslt.so.1()(64bit) libz.so.1()(64bit) rpmlib(PayloadIsLzma) = 4.4.6-1 Regards Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] libglue2 dependency missing in cluster-glue
Hi as some of you might have seen on the pacemaker list, I tried to install a 3 node cluster and there were ipc issues reported by the cib and therefore the cluster could not start correctly. The reason was that libglue2 and cluster-glue were not installed from the clusterlabs repository, as the rest of the packages were, but instead they were pulled from the original opensuse repository in an older version. So I went and updated cluster-glue with the version from the clusterlabs repository. Nothing changed though. rpm -qa|grep glue revealed that libglue2 was still the old version while cluster-glue was updated. Looking at the package dependencies, I think the problem is that cluster-glue does not depend on package libglue2 (while they do the other way around). So one error, which I could improve, was that the installation instructions on the clusterlabs site did not mention libglue2 and cluster-glue. They do now which should prevent others who follow those instructions. The dependency thing is up for grabs ;) Regards Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] libglue2 dependency missing in cluster-glue
Hi Dejan The reason was that libglue2 and cluster-glue were not installed from the clusterlabs repository, as the rest of the packages were, but instead they were pulled from the original opensuse repository in an older version. This is what I found in pacemaker.spec.in in the repository: Requires(pre): cluster-glue = 1.0.6 Which version of glue was that older version? 0.9.1 So you're saying pacemaker depends on cluster-glue 1.0.6. Well, that was not installed when I installed pacemaker. And I did not use --nodeps or such thing. Instead, that old version was installed from the original opensuse repositories. So I went and updated cluster-glue with the version from the clusterlabs repository. Nothing changed though. rpm -qa|grep glue revealed that libglue2 was still the old version while cluster-glue was updated. Looking at the package dependencies, I think the problem is that cluster-glue does not depend on package libglue2 (while they do the other way around). Yes, I guess that that should be fixed. Regards Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] stonith + APC Masterswitch (AP9225 + AP9616)
You could also try apcmastersnmp. Got that to work with apc devices which did not work with the telnet thing. As long as they didn't change mibs (which I don't know whether they have). Might be worth a shot. Regards Dominik On 02/25/2011 02:24 AM, Avestan wrote: Hello Dejan, As I am trying to use the patch that you pointed, I have got the following: [root@server1 ~]# patch /usr/src/apcmaster.c.patch can't find file to patch at input line 3 Perhaps you should have used the -p or --strip option? The text leading up to this was: -- |--- apcmaster_orig.c 2007-12-21 09:32:27.0 -0600 |+++ apcmaster.c2008-02-19 18:24:48.0 -0600 -- File to patch: [root@server1~]# Could you tell me which file I am trying to patch? The closet file that I can find which may need patching is: /usr/lib/stonith/plugins/stonith2/apcmaster.so Thank you in advance for your help, Dejan. Avestan Dejan Muhamedagic wrote: Hi, On Wed, Feb 23, 2011 at 06:44:02PM -0800, Avestan wrote: Hello everyone, I have been trying to get STONITH work with the MasterSwitch Plus (AP9225 + AP9617) with very little success. I have to mention that I am able to access the MasterSwitch Plus via either serial port or Ethernet port with no issue. But when I try the stonith command, it appears that there is some issue with the login. [root@server1 ~]# stonith -d 1 -t apcmaster -p 192.168.1.50 apc apc -l ** (process:4930): DEBUG: NewPILPluginUniv(0xa03b008) ** (process:4930): DEBUG: PILS: Plugin path = /usr/lib/stonith/plugins:/usr/lib/pils/plugins ** (process:4930): DEBUG: NewPILInterfaceUniv(0xa03c910) ** (process:4930): DEBUG: NewPILPlugintype(0xa03ca68) ** (process:4930): DEBUG: NewPILPlugin(0xa03cdc8) ** (process:4930): DEBUG: NewPILInterface(0xa03cb00) ** (process:4930): DEBUG: NewPILInterface(0xa03cb00:InterfaceMgr/InterfaceMgr)*** user_data: 0x0 *** ** (process:4930): DEBUG: InterfaceManager_plugin_init(0xa03cb00/InterfaceMgr) ** (process:4930): DEBUG: Registering Implementation manager for Interface type 'InterfaceMgr' ** (process:4930): DEBUG: PILS: Looking for InterfaceMgr/generic = [/usr/lib/stonith/plugins/InterfaceMgr/generic.so] ** (process:4930): DEBUG: Plugin file /usr/lib/stonith/plugins/InterfaceMgr/generic.so does not exist ** (process:4930): DEBUG: PILS: Looking for InterfaceMgr/generic = [/usr/lib/pils/plugins/InterfaceMgr/generic.so] ** (process:4930): DEBUG: Plugin path for InterfaceMgr/generic = [/usr/lib/pils/plugins/InterfaceMgr/generic.so] ** (process:4930): DEBUG: PluginType InterfaceMgr already present ** (process:4930): DEBUG: Plugin InterfaceMgr/generic init function: InterfaceMgr_LTX_generic_pil_plugin_init ** (process:4930): DEBUG: NewPILPlugin(0xa03d690) ** (process:4930): DEBUG: Plugin InterfaceMgr/generic loaded and constructed. ** (process:4930): DEBUG: Calling init function in plugin InterfaceMgr/generic. ** (process:4930): DEBUG: NewPILInterface(0xa03d630) ** (process:4930): DEBUG: NewPILInterface(0xa03d630:InterfaceMgr/stonith2)*** user_data: 0xa03ce80 *** ** (process:4930): DEBUG: Registering Implementation manager for Interface type 'stonith2' ** (process:4930): DEBUG: IfIncrRefCount(1 + 1 ) ** (process:4930): DEBUG: PluginIncrRefCount(0 + 1 ) ** (process:4930): DEBUG: IfIncrRefCount(1 + 100 ) ** (process:4930): DEBUG: PILS: Looking for stonith2/apcmaster = [/usr/lib/stonith/plugins/stonith2/apcmaster.so] ** (process:4930): DEBUG: Plugin path for stonith2/apcmaster = [/usr/lib/stonith/plugins/stonith2/apcmaster.so] ** (process:4930): DEBUG: Creating PluginType for stonith2 ** (process:4930): DEBUG: NewPILPlugintype(0xa03db70) ** (process:4930): DEBUG: Plugin stonith2/apcmaster init function: stonith2_LTX_apcmaster_pil_plugin_init ** (process:4930): DEBUG: NewPILPlugin(0xa03dcf0) ** (process:4930): DEBUG: Plugin stonith2/apcmaster loaded and constructed. ** (process:4930): DEBUG: Calling init function in plugin stonith2/apcmaster. ** (process:4930): DEBUG: NewPILInterface(0xa03dc38) ** (process:4930): DEBUG: NewPILInterface(0xa03dc38:stonith2/apcmaster)*** user_data: 0x26da2c *** ** (process:4930): DEBUG: IfIncrRefCount(101 + 1 ) ** (process:4930): DEBUG: PluginIncrRefCount(0 + 1 ) ** (process:4930): DEBUG: Got '\xff' ** (process:4930): DEBUG: Got '\xfb' ** (process:4930): DEBUG: Got '\u0001' ** (process:4930): DEBUG: Got ' ' ** (process:4930): DEBUG: Got '\u000d' ** (process:4930): DEBUG: Got 'U' ** (process:4930): DEBUG: Got 's' ** (process:4930): DEBUG: Got 'e' ** (process:4930): DEBUG: Got 'r' ** (process:4930): DEBUG: Got ' ' ** (process:4930): DEBUG: Got 'N' ** (process:4930): DEBUG: Got 'a' ** (process:4930): DEBUG: Got 'm' ** (process:4930): DEBUG: Got 'e' ** (process:4930): DEBUG: Got ' ' ** (process:4930): DEBUG: Got ':' ** (process:4930): DEBUG: Got ' ' ** (process:4930): CRITICAL **:
Re: [Linux-ha-dev] Feedback on conntrackd RA by Dominik Klein
Thanks for inclusion. While looking through the pushed changes, I spotted two meta-data typos. See trivial patch. Regards Dominik Applied and pushed with two minor edits. Thanks a lot! Cheers, Florian --- conntrackd.orig 2011-02-14 11:43:22.0 +0100 +++ conntrackd 2011-02-14 11:43:42.0 +0100 @@ -57,7 +57,7 @@ longdesc lang=enName of the conntrackd executable. If conntrackd is installed and available in the default PATH, it is sufficient to configure the name of the binary For example my-conntrackd-binary-version-0.9.14 -If conntrackd is installed somehwere else, you may also give a full path +If conntrackd is installed somewhere else, you may also give a full path For example /packages/conntrackd-0.9.14/sbin/conntrackd /longdesc shortdesc lang=enName of the conntrackd executable/shortdesc @@ -66,7 +66,7 @@ parameter name=config longdesc lang=enFull path to the conntrackd.conf file. -For example /packages/conntrackd-0.9.4/etc/conntrackd/conntrackd.conf/longdesc +For example /packages/conntrackd-0.9.14/etc/conntrackd/conntrackd.conf/longdesc shortdesc lang=enPath to conntrackd.conf/shortdesc content type=string default=$OCF_RESKEY_config_default/ /parameter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Feedback on conntrackd RA by Dominik Klein
Hi Florian it appears that the RA is good to be merged with just a few changes left to be done. Great! * Please fix the initialization to honor $OCF_FUNCTIONS_DIR and ditch the redundant locale initialization. done * Please rename the parameters to follow the precendents set by other RAs (binary instead of conntrackd, config instead of conntrackdconf). done * Please don't require people to set a full path to the conntrackd binary, honoring $PATH is expected. I don't see where I do that. At least code-wise I never did that. Did you mean the meta-data? * Please set defaults the way the other RAs do, rather than with your if [ -z OCF_RESKEY_whatever ] logic. done * Please define the default path to your statefile in relative to ${HA_RSCTMP}. Also, put ${OCF_RESOURCE_INSTANCE} in the filename. done * Actually, rather than managing your statefile manually, you might be able to just use ha_pseudo_resource(). done nice function btw :) * Please revise your timeouts. Is a 240-second minimum timeout on start not a bit excessive? Sure is. Copy and paste leftover. Changed to 30. * Please revise your metadata, specifically your longdescs. The more useful information you provide to users, the better. Recall that that information is readily available to users via the man pages and crm ra info. done Regards Dominik --- conntrackd 2011-02-10 12:23:37.054678924 +0100 +++ conntrackd.fghaas 2011-02-11 09:45:39.721300359 +0100 @@ -4,7 +4,7 @@ # An OCF RA for conntrackd # http://conntrack-tools.netfilter.org/ # -# Copyright (c) 2010 Dominik Klein +# Copyright (c) 2011 Dominik Klein # # This program is free software; you can redistribute it and/or modify # it under the terms of version 2 of the GNU General Public License as @@ -25,11 +25,19 @@ # along with this program; if not, write the Free Software Foundation, # Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA. # + ### # Initialization: -. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs -export LANG=C LANGUAGE=C LC_ALL=C +: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat} +. ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs + +### + +OCF_RESKEY_binary_default=/usr/sbin/conntrackd +OCF_RESKEY_config_default=/etc/conntrackd/conntrackd.conf +: ${OCF_RESKEY_binary=${OCF_RESKEY_binary_default}} +: ${OCF_RESKEY_config=${OCF_RESKEY_config_default}} meta_data() { cat END @@ -46,30 +54,30 @@ parameters parameter name=conntrackd -longdesc lang=enFull path to conntrackd executable/longdesc -shortdesc lang=enFull path to conntrackd executable/shortdesc -content type=string default=/usr/sbin/conntrackd/ +longdesc lang=enName of the conntrackd executable. +If conntrackd is installed and available in the default PATH, it is sufficient to configure the name of the binary +For example my-conntrackd-binary-version-0.9.14 +If conntrackd is installed somehwere else, you may also give a full path +For example /packages/conntrackd-0.9.14/sbin/conntrackd +/longdesc +shortdesc lang=enName of the conntrackd executable/shortdesc +content type=string default=$OCF_RESKEY_binary_default/ /parameter -parameter name=conntrackdconf -longdesc lang=enFull path to the conntrackd.conf file./longdesc +parameter name=config +longdesc lang=enFull path to the conntrackd.conf file. +For example /packages/conntrackd-0.9.4/etc/conntrackd/conntrackd.conf/longdesc shortdesc lang=enPath to conntrackd.conf/shortdesc -content type=string default=/etc/conntrackd/conntrackd.conf/ -/parameter - -parameter name=statefile -longdesc lang=enFull path to the state file you wish to use./longdesc -shortdesc lang=enFull path to the state file you wish to use./shortdesc -content type=string default=/var/run/conntrackd.master/ +content type=string default=$OCF_RESKEY_config_default/ /parameter /parameters actions -action name=start timeout=240 / -action name=promote timeout=90 / -action name=demote timeout=90 / -action name=notify timeout=90 / -action name=stoptimeout=100 / +action name=start timeout=30 / +action name=promote timeout=30 / +action name=demote timeout=30 / +action name=notify timeout=30 / +action name=stoptimeout=30 / action name=monitor depth=0 timeout=20 interval=20 role=Slave / action name=monitor depth=0 timeout=20 interval=10 role=Master / action name=meta-data timeout=5 / @@ -94,11 +102,7 @@ conntrackd_is_master() { # You can't query conntrackd whether it is master or slave. It can be both at the same time. # This RA creates a statefile during promote and enforces master-max=1 and clone-node-max=1 - if [ -e $STATEFILE ]; then - return $OCF_SUCCESS - else - return $OCF_ERR_GENERIC - fi + ha_pseudo_resource $statefile monitor } conntrackd_set_master_score() { @@ -108,11 +112,11 @@ conntrackd_monitor() { rc=$OCF_NOT_RUNNING # It does not write a PID file, so check
Re: [Linux-ha-dev] Feedback on conntrackd RA by Dominik Klein
Maybe you applied the s/100/$slavescore patch someone sent a couple weeks ago. I used the last version from thread New stateful RA: conntrackd dated october 27th 3:29pm. Anyway, here's my version. Regards Dominik On 02/11/2011 01:36 PM, Florian Haas wrote: On 2011-02-11 09:48, Dominik Klein wrote: Hi Florian it appears that the RA is good to be merged with just a few changes left to be done. Great! [lots of exemplary role-model patch modifications] Regards Dominik Thanks! For some reason the patch does not apply in my checkout. Can you just send me your version? I'll figure it out then. Cheers, Florian #!/bin/bash # # # An OCF RA for conntrackd # http://conntrack-tools.netfilter.org/ # # Copyright (c) 2011 Dominik Klein # # This program is free software; you can redistribute it and/or modify # it under the terms of version 2 of the GNU General Public License as # published by the Free Software Foundation. # # This program is distributed in the hope that it would be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # # Further, this software is distributed without any warranty that it is # free of the rightful claim of any third person regarding infringement # or the like. Any license provided herein, whether implied or # otherwise, applies only to this software file. Patent licenses, if # any, provided herein do not apply to combinations of this program with # other software, or any other product whatsoever. # # You should have received a copy of the GNU General Public License # along with this program; if not, write the Free Software Foundation, # Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA. # ### # Initialization: : ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat} . ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs ### OCF_RESKEY_binary_default=/usr/sbin/conntrackd OCF_RESKEY_config_default=/etc/conntrackd/conntrackd.conf : ${OCF_RESKEY_binary=${OCF_RESKEY_binary_default}} : ${OCF_RESKEY_config=${OCF_RESKEY_config_default}} meta_data() { cat END ?xml version=1.0? !DOCTYPE resource-agent SYSTEM ra-api-1.dtd resource-agent name=conntrackd version1.1/version longdesc lang=en Master/Slave OCF Resource Agent for conntrackd /longdesc shortdesc lang=enThis resource agent manages conntrackd/shortdesc parameters parameter name=conntrackd longdesc lang=enName of the conntrackd executable. If conntrackd is installed and available in the default PATH, it is sufficient to configure the name of the binary For example my-conntrackd-binary-version-0.9.14 If conntrackd is installed somehwere else, you may also give a full path For example /packages/conntrackd-0.9.14/sbin/conntrackd /longdesc shortdesc lang=enName of the conntrackd executable/shortdesc content type=string default=$OCF_RESKEY_binary_default/ /parameter parameter name=config longdesc lang=enFull path to the conntrackd.conf file. For example /packages/conntrackd-0.9.4/etc/conntrackd/conntrackd.conf/longdesc shortdesc lang=enPath to conntrackd.conf/shortdesc content type=string default=$OCF_RESKEY_config_default/ /parameter /parameters actions action name=start timeout=30 / action name=promote timeout=30 / action name=demote timeout=30 / action name=notify timeout=30 / action name=stoptimeout=30 / action name=monitor depth=0 timeout=20 interval=20 role=Slave / action name=monitor depth=0 timeout=20 interval=10 role=Master / action name=meta-data timeout=5 / action name=validate-all timeout=30 / /actions /resource-agent END } meta_expect() { local what=$1 whatvar=OCF_RESKEY_CRM_meta_${1//-/_} op=$2 expect=$3 local val=${!whatvar} if [[ -n $val ]]; then # [, not [[, or it won't work ;) [ $val $op $expect ] return fi ocf_log err meta parameter misconfigured, expected $what $op $expect, but found ${val:-unset}. exit $OCF_ERR_CONFIGURED } conntrackd_is_master() { # You can't query conntrackd whether it is master or slave. It can be both at the same time. # This RA creates a statefile during promote and enforces master-max=1 and clone-node-max=1 ha_pseudo_resource $statefile monitor } conntrackd_set_master_score() { ${HA_SBIN_DIR}/crm_master -Q -l reboot -v $1 } conntrackd_monitor() { rc=$OCF_NOT_RUNNING # It does not write a PID file, so check with pgrep pgrep -f $OCF_RESKEY_binary rc=$OCF_SUCCESS if [ $rc -eq $OCF_SUCCESS ]; then # conntrackd is running # now see if it acceppts queries if ! $OCF_RESKEY_binary -C $OCF_RESKEY_config -s /dev/null 21; then rc=$OCF_ERR_GENERIC ocf_log err conntrackd is running but not responding
Re: [Linux-ha-dev] Feedback on conntrackd RA by Dominik Klein
Not yet. That's why I wrote soon_-ish_ ;) Any release coming up you want to include this in? any news on this? Cheers, Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Report on conntrackd RA
Hi thanks for testing and feedback. On 01/27/2011 01:37 PM, Marjan, BlatnikČŠŽ wrote: Conntrackd RA from Dominik Klein works. We can now successfully migrate/fail from one node to another one. At the begining, we have problems with failing. After reboot/fail, the slave was not synced with master. After some debuging I found, that the conntrackd must not be started at boot time, but only by pacemaker. Like any other program managed by the cluster. Regards Dominik My mistake. After disabling conntrackd boot script, failing works perfectly. If conntrackd on slave is started by init script, then master does not issue conntrackd notify with OCF_RESKEY_CRM_meta_notify_type=post and OCF_RESKEY_CRM_meta_notify_operation=start and does not send a bulk update to the slave. Master does issue conntrackd notify with OCF_RESKEY_CRM_meta_notify_type set to pre, but since conntrackd on slave is running, there is no post phase, which send a bulk update to the slave. OCF_RESKEY_CRM_meta_notify_type may be ignored and send bulk update two times, but it's better to control conntrackd only by pacemaker. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Feedback on conntrackd RA by Dominik Klein
Just now found this thread. I will include the suggested changes and post the new RA soon-ish. Dominik On 01/21/2011 08:26 AM, Florian Haas wrote: On 01/18/2011 04:21 PM, Florian Haas wrote: Our site will shortly be deploying a new HA firewall based on Linux, iptables, pacemaker and conntrackd. conntrackd[1] is used to maintain connection state of active connections across the two firewalls allowing us to failover from one firewall to the other without dropping any connections. In order to achieve this with pacemaker we needed to find a resource agent for conntrackd. Looking at the mailing list we found a couple of options although we only fully evaluated the RA produced by Dominik Klein as it appears to be more feature complete than the alternative. For a full description of his RA please see his original thread[2]. So far throughout testing we have been very pleased with it. We can successfully fail between our nodes and the RA correctly handles the synchronisation steps required in the background. Dominik, it appears that the RA is good to be merged with just a few changes left to be done. * Please fix the initialization to honor $OCF_FUNCTIONS_DIR and ditch the redundant locale initialization. * Please rename the parameters to follow the precendents set by other RAs (binary instead of conntrackd, config instead of conntrackdconf). * Please don't require people to set a full path to the conntrackd binary, honoring $PATH is expected. * Please set defaults the way the other RAs do, rather than with your if [ -z OCF_RESKEY_whatever ] logic. * Please define the default path to your statefile in relative to ${HA_RSCTMP}. Also, put ${OCF_RESOURCE_INSTANCE} in the filename. * Actually, rather than managing your statefile manually, you might be able to just use ha_pseudo_resource(). * Please revise your timeouts. Is a 240-second minimum timeout on start not a bit excessive? * Please revise your metadata, specifically your longdescs. The more useful information you provide to users, the better. Recall that that information is readily available to users via the man pages and crm ra info. Thanks! Cheers, Florian -- IN-telegence GmbH Oskar-Jäger-Str. 125 50825 Köln Registergericht AG Köln - HRB 34038 USt-ID DE210882245 Geschäftsführende Gesellschafter: Christian Plätke und Holger Jansen ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Feedback on conntrackd RA by Dominik Klein
Or, put differently: is us tracking the supposed state really necessary, or can we inquire it from the service somehow? From the submitted RA: # You can't query conntrackd whether it is master or slave. It can be both at the same time. # This RA creates a statefile during promote and enforces master-max=1 and clone-node-max=1 Knowing Dominik I think it's safe to assume he's done his homework on this, and hasn't put in this comment without careful consideration. If I knew a way to query the state, believe me, I would use it. I totally understand this seems ugly the way it is and I agree 100%. However, having a master/slave RA is what the cluster needs imho to fully support conntrackd. Encouraging people to start conntrackd by init and then have the RA just execute commands for state-shipping seemed and seems odd to me (that's what the first RA did). But I'm sure he won't mind if you manage to convince him otherwise. Sure I won't. Maybe a newer version (if exists) includes this. I'll have another look. Regards Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] New stateful RA: conntrackd
Hi everybody So I updated my RA according to Florian's comments on Jonathan Petersson's conntrackd RA. I also contacted him in order to merge our RAs, no reply there yet. Once we talked, you will get an update by one of us. Regards Dominik #!/bin/bash # # # An OCF RA for conntrackd # http://conntrack-tools.netfilter.org/ # # Copyright (c) 2010 Dominik Klein # # This program is free software; you can redistribute it and/or modify # it under the terms of version 2 of the GNU General Public License as # published by the Free Software Foundation. # # This program is distributed in the hope that it would be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # # Further, this software is distributed without any warranty that it is # free of the rightful claim of any third person regarding infringement # or the like. Any license provided herein, whether implied or # otherwise, applies only to this software file. Patent licenses, if # any, provided herein do not apply to combinations of this program with # other software, or any other product whatsoever. # # You should have received a copy of the GNU General Public License # along with this program; if not, write the Free Software Foundation, # Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA. # ### # Initialization: . ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs export LANG=C LANGUAGE=C LC_ALL=C meta_data() { cat END ?xml version=1.0? !DOCTYPE resource-agent SYSTEM ra-api-1.dtd resource-agent name=conntrackd version1.1/version longdesc lang=en Master/Slave OCF Resource Agent for conntrackd /longdesc shortdesc lang=enThis resource agent manages conntrackd/shortdesc parameters parameter name=conntrackd longdesc lang=enFull path to conntrackd executable/longdesc shortdesc lang=enFull path to conntrackd executable/shortdesc content type=string default=/usr/sbin/conntrackd/ /parameter parameter name=conntrackdconf longdesc lang=enFull path to the conntrackd.conf file./longdesc shortdesc lang=enPath to conntrackd.conf/shortdesc content type=string default=/etc/conntrackd/conntrackd.conf/ /parameter parameter name=statefile longdesc lang=enFull path to the state file you wish to use./longdesc shortdesc lang=enFull path to the state file you wish to use./shortdesc content type=string default=/var/run/conntrackd.master/ /parameter /parameters actions action name=start timeout=240 / action name=promote timeout=90 / action name=demote timeout=90 / action name=notify timeout=90 / action name=stoptimeout=100 / action name=monitor depth=0 timeout=20 interval=20 role=Slave / action name=monitor depth=0 timeout=20 interval=10 role=Master / action name=meta-data timeout=5 / action name=validate-all timeout=30 / /actions /resource-agent END } meta_expect() { local what=$1 whatvar=OCF_RESKEY_CRM_meta_${1//-/_} op=$2 expect=$3 local val=${!whatvar} if [[ -n $val ]]; then # [, not [[, or it won't work ;) [ $val $op $expect ] return fi ocf_log err meta parameter misconfigured, expected $what $op $expect, but found ${val:-unset}. exit $OCF_ERR_CONFIGURED } conntrackd_is_master() { # You can't query conntrackd whether it is master or slave. It can be both at the same time. # This RA creates a statefile during promote and enforces master-max=1 and clone-node-max=1 if [ -e $STATEFILE ]; then return $OCF_SUCCESS else return $OCF_ERR_GENERIC fi } conntrackd_set_master_score() { ${HA_SBIN_DIR}/crm_master -Q -l reboot -v $1 } conntrackd_monitor() { rc=$OCF_NOT_RUNNING # It does not write a PID file, so check with pgrep pgrep -f $CONNTRACKD rc=$OCF_SUCCESS if [ $rc = $OCF_SUCCESS ]; then # conntrackd is running # now see if it acceppts queries if ! ($CONNTRACKD -C $CONNTRACKD_CONF -s /dev/null 21); then rc=$OCF_ERR_GENERIC ocf_log err conntrackd is running but not responding to queries fi if conntrackd_is_master; then rc=$OCF_RUNNING_MASTER # Restore master setting on probes if [ $OCF_RESKEY_CRM_meta_interval -eq 0 ]; then conntrackd_set_master_score $master_score fi else # Restore master setting on probes if [ $OCF_RESKEY_CRM_meta_interval -eq 0 ]; then conntrackd_set_master_score $slave_score fi fi fi return $rc } conntrackd_start() { rc=$OCF_ERR_GENERIC # Keep
[Linux-ha-dev] New stateful RA: conntrackd
Hi everybody, I wrote a master/slave RA to manage conntrackd, the connection tracking daemon from the netfilter project. Conntrackd is used to replicate connection state between highly available stateful firewalls. Conntrackd replicates data using multicast. Basically it sends state information about connections written to its kernels connection tracking table. Replication slaves write these updates to an external cache. When a firewall is to take over the master role, it commits the external cache to the kernel and so knows the connections that were previously running through the old master system and clients can continue working without having to open a new connection. While there has been an RA for conntrackd (at least I found something that looked like it in a pastebin using google), this one was not able to deal with failback, which is a thing I needed, and was not yet included in the repository. I hope this one will be included. The main challenge in this RA was the failback part. Say one system goes down completely. Then it loses the kernel connection tracking table and the external cache. Once it comes back, it will receive updates for new connections that are initiated through the master, but it will neither be sent the complete tracking table of the current master, nor can it request this (that's how I understand and tested conntrackd works, please correct me if I'm wrong :)). This may be acceptable for short-lived connections and configurations where there is no preferred master system, but it does become a problem if you have either of those. So my approach is to send a so called bulk update in two situations: a) in the notify pre promote call, if the local machine is not the machine to be promoted This part is responsible for sending the update to a preferred master that had previously failed (failback). b) in the notify post start call, if the local machine is the master This part is responsible for sending the update to a previously failed machine that re-joins the cluster but is not to be promoted right away. For now I limited the RA to deal with only 2 clones and 1 master since this is the only testbed I have and I am not 100% sure what happens to the new master in situation a) if there are multiple slaves. Configuration could look like this, notify=true is important: primitive conntrackd ocf:intelegence:conntrackd \ op monitor interval=10 timeout=10 \ op monitor interval=11 role=Master timeout=10 primitive ip-extern ocf:heartbeat:IPaddr2 \ params ip=10.2.50.237 cidr_netmask=24 \ op monitor interval=10 timeout=10 primitive ip-intern ocf:heartbeat:IPaddr2 \ params ip=10.2.52.3 cidr_netmask=24 \ op monitor interval=10 timeout=10 ms ms-conntrackd conntrackd \ meta target-role=Started globally-unique=false notify=true colocation ip-intern-extern inf: ip-extern ip-intern colocation ips-on-conntrackd-master inf: ip-intern ms-conntrackd:Master order ips-after-conntrackd inf: ms-conntrackd:promote ip-intern:start Please review and test the RA, post comments and questions. Maybe it can be included in the repository. Regards Dominik ps. yes, some parts are from linbit's drbd RA and some parts may also be from Andrew's Stateful RA. Hope that's okay. -- IN-telegence GmbH Oskar-Jäger-Str. 125 50825 Köln Registergericht AG Köln - HRB 34038 USt-ID DE210882245 Geschäftsführende Gesellschafter: Christian Plätke und Holger Jansen #!/bin/bash # # # An OCF RA for conntrackd # http://conntrack-tools.netfilter.org/ # # Copyright (c) 2010 Dominik Klein # # This program is free software; you can redistribute it and/or modify # it under the terms of version 2 of the GNU General Public License as # published by the Free Software Foundation. # # This program is distributed in the hope that it would be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # # Further, this software is distributed without any warranty that it is # free of the rightful claim of any third person regarding infringement # or the like. Any license provided herein, whether implied or # otherwise, applies only to this software file. Patent licenses, if # any, provided herein do not apply to combinations of this program with # other software, or any other product whatsoever. # # You should have received a copy of the GNU General Public License # along with this program; if not, write the Free Software Foundation, # Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA. # ### # Initialization: . ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs export LANG=C LANGUAGE=C LC_ALL=C meta_data() { cat END ?xml version=1.0? !DOCTYPE resource-agent SYSTEM ra-api-1.dtd resource-agent name=conntrackd version1.1/version longdesc lang=en Master/Slave OCF Resource Agent for conntrackd /longdesc shortdesc lang
Re: [Linux-HA] Need suggestion on STONITH device
On 04/01/2010 08:54 PM, Tony Gan wrote: Thank you for your reply, Dominik. I think UPS or PDU in this case is a better solution than a lights-out device, since they have separate power supply. They do? They _are_ the power supply for the node. So if the PDU supply is off, the node is off. I have not seen a PDU with multiple inputs yet (but there may be such device, I am no expert on that). Regards Dominik And I don't think we need to manage UPS or PDU's failure by our self, the manufacturer should take responsibility of this. Am I correct? But yes, probably need additional budgets for this. Anyway, again, thanks for your advice. I'm going to do some research on them. On Thu, Apr 1, 2010 at 6:38 AM, Dominik Klein d...@in-telegence.net wrote: Tony Gan wrote: Hi, For a two-node cluster, what are the best STONITH devices? Currently I am using Dell's iDrac for STONITH device. It works pretty well. However the biggest problem for iDrac or any other lights-out devices is that they share power supply with hosts machines. Once an active machine lost its power completely, you want to fail-over to the backup-node in your cluster. But with iDrac as your STONITH device you can not, because the STONITH resource on backup node will run into error (fail to connect to STONITH device, it's out of power too) , and refuse to start any resources. I was wondering what kind of STONITH devices everybody is using to solve this problem. And how much are they? Actually Pacemaker's page have a link talking about this: http://www.clusterlabs.org/doc/crm_fencing.html It suggests UPS (Uninterruptible Power Supply) as well as PDU (Power Distribution Unit). Anybody used them before? How well are they integrated with Heartbeat? What are the pros and cons? Hi I am using APC PDUs for my clusters. The setup is like: power supply circuit 1 - pdu 1 - node 1 power supply circuit 2 - pdu 2 - node 2 If a node fails, the corresponding pdu usually is accessible and manageable. However, if a pdu fails (and they probably can fail in ways we cannot really imagine (to quote Dejan)) that renders the same problem as yours. The node is down, the stonith device is down, so no resource takeover. But imho, this is not resolvable. At least I do not know of a way how to. If a PDU or UPS fails (node down and power device down), then the resources for the failed node will not be recovered since the failed node cannot be shot. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Need suggestion on STONITH device
Tony Gan wrote: Hi, For a two-node cluster, what are the best STONITH devices? Currently I am using Dell's iDrac for STONITH device. It works pretty well. However the biggest problem for iDrac or any other lights-out devices is that they share power supply with hosts machines. Once an active machine lost its power completely, you want to fail-over to the backup-node in your cluster. But with iDrac as your STONITH device you can not, because the STONITH resource on backup node will run into error (fail to connect to STONITH device, it's out of power too) , and refuse to start any resources. I was wondering what kind of STONITH devices everybody is using to solve this problem. And how much are they? Actually Pacemaker's page have a link talking about this: http://www.clusterlabs.org/doc/crm_fencing.html It suggests UPS (Uninterruptible Power Supply) as well as PDU (Power Distribution Unit). Anybody used them before? How well are they integrated with Heartbeat? What are the pros and cons? Hi I am using APC PDUs for my clusters. The setup is like: power supply circuit 1 - pdu 1 - node 1 power supply circuit 2 - pdu 2 - node 2 If a node fails, the corresponding pdu usually is accessible and manageable. However, if a pdu fails (and they probably can fail in ways we cannot really imagine (to quote Dejan)) that renders the same problem as yours. The node is down, the stonith device is down, so no resource takeover. But imho, this is not resolvable. At least I do not know of a way how to. If a PDU or UPS fails (node down and power device down), then the resources for the failed node will not be recovered since the failed node cannot be shot. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] messages from existing hearbeat on the same lan
Aclhk Aclhk wrote: On the same lan, there are already two heartbeat node 136pri and 137sec. I setup another 2 nodes with heartbeat. they keep receiving the following messages: heartbeat[9931]: 2010/01/19_10:53:01 WARN: string2msg_ll: node [136pri] failed authentication heartbeat[9931]: 2010/01/19_10:53:02 WARN: Invalid authentication type [1] in message! heartbeat[9931]: 2010/01/19_10:53:02 WARN: string2msg_ll: node [137sec] failed authentication heartbeat[9931]: 2010/01/19_10:53:02 WARN: Invalid authentication type [1] in message! ha.cf debugfile /var/log/ha-debug logfile /var/log/ha-log logfacility local0 bcast eth0 keepalive 5 warntime 10 deadtime 120 initdead 120 auto_failback off node 140openfiler1 node 141openfiler2 bcast for all nodes are same, that is eth0 pls advise how to avoid the messages. Use mcast or ucast instead of bcast? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] ulimit in ocf scripts
Andrew Beekhof wrote: On Tue, Jan 12, 2010 at 10:43 AM, Raoul Bhatia [IPAX] r.bha...@ipax.at wrote: On 01/12/2010 10:39 AM, Florian Haas wrote: Why not simply set that for root at boot? (it rhymes too :) because i do not like the idea that each and every process gets elevated limits by default. i think that there *should* be a generic way to configure ulimits an a per resource basis. I'm confident Dejan would be happy to accept a patch in which you add such a parameter to each resource agent where it makes sense. of course this would be possible. but i *think* it is more helpful to add this to e.g. the cib/lrmd/you name it. so before i/we implement the ulimit stuff *inside* lots of different RAs, i'd like to hear beekhof's or lars' comments. If you want a configurable per-resource limit - thats a resource parameter. Why would we want to implement another mechanism? Of course this would be a resource parameter. I think what he meant to say was that he does not want to have the change inside every RA executing the ulimit command but to have some cluster component (probably lrmd) do that. Regards Dominik ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [PATCH]Support of stop escalation for mysql-RA.
I'd suggest an approach like Florian's from the Virtualdomain RA. Here's a quote, guess you get the idea. shutdown_timeout=$((($OCF_RESKEY_CRM_meta_timeout/1000)-5)) Regards Dominik Dejan Muhamedagic wrote: Hi Hideo-san, On Mon, Nov 30, 2009 at 11:00:05AM +0900, renayama19661...@ybb.ne.jp wrote: Hi, We discovered a problem in an test of mysql. It is the problem that mysql cannot stop. This problem seems to occur at the time of diskfull and high CPU load. We included an escalation stop like pgsql. The problem is broken off by this revision, and a stop succeeds. Please commit this patch in a development version. Many thanks for the patch. I'm just not sure about the default escalate time. You set it to 30 seconds, perhaps it should be set to something longer. Otherwise some cluster configurations where the stop operation takes longer may have problems. I have no idea which value should we use, but I would tend to make it longer rather than shorter. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] heartbeat - execute a script on a running node when the other node is back?
Tomasz Chmielewski wrote: Dejan Muhamedagic wrote: Hi, On Sun, Nov 15, 2009 at 09:09:53PM +0100, Tomasz Chmielewski wrote: I have two nodes, node_1 and node_2. node_2 was down, but is now up. How can I execute a custom script on node_1 when it detects that node_2 is back? That's not possible. What would you want to with that script? I have two PostgreSQL servers running; pgpool-ii is started by Heartbeat to distribute the load (reads) among two servers and to send writes to both servers. When one PostgreSQL server fails, the setup will still work fine. When the failed PostgreSQL instance is back, the data should be first synchronized from the running PostgreSQL server to a server which was failed a while ago. It is best if such a script could be started by Heartbeat running on the active node, as soon as it detects that the other node is back. If you need such thing - I'd personally be most comfortable with not starting the cluster at boot time. Then you can do whatever you need to do and then - when you _know_ everything is right, the script is done etc. - start the cluster software. Just my personal preference. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Restrict resources to specific nodes only
Kenneth Simbron wrote: Hi, Is there a way to restrict some resources to work only on specific nodes and other resources on another nodes? http://clusterlabs.org/mediawiki/images/f/fb/Configuration_Explained.pdf Read up on location constraints. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] Monitor operation for the Filesystem RA
Dejan Muhamedagic wrote: Hi Florian, On Wed, Sep 16, 2009 at 08:25:30AM +0200, Florian Haas wrote: Lars, Dejan, as discussed on #linux-ha yesterday, I've pushed a small changeset to the Filesystem RA that implements a monitor operation which checks whether I/O on the mounted filesystem is in fact possible. Any suggestions for improvement would be most welcome. IMO, the monitor operation is now difficult to understand. I don't mean the code, I didn't take a look at the code yet, but the usage. Also, as soon as you set the statusfile_prefix parameter, the 0 depth monitor changes behaviour. I don't find that good. The basic monitor operation should remain the same and just test if the filesystem is mounted as it always used to. I agree. The new parameter should influence only the monitor operations of higher (deeper :) depth. So, I'd propose to have two depths, say 10 and 20, of which the first would be just the read test and the second read-write. Why not 1 and 2? Then we'd have 0 = old behaviour 1 = read 2 = read/write Finally, the statusfile_prefix should be optional for deeper monitor operations and default to .${OCF_RESOURCE_INSTANCE}. If OCF_RESOURCE_INSTANCE doesn't contain the clone instance, then we should append the clone instance number (I suppose that it's available somewhere). As fgh said, when you want to monitor a readonly fs, you'd have to know the clone instance number for creating the file to read from. Not a good idea imho. Or you'll have several files around which would be even more ugly when you think about a larger cluster. Why do we have to make the name configurable at all? Why not just give it a generic name and only let the user configure OCF_CHECK_LEVEL for each monitor? That said, I have not dealt with cluster filesystems yet. Was the hostname-idea to avoid having multiple monitor instances trying to write to one file and maybe run into locking/timeout issues? Regards Dominik I hope that this way the usage would be more straightforward. At least it looks so to me. Do I win the prize for the longest changeset description or what? ;) We need good documentation. I think it's great to write such descriptions :) Cheers, Dejan Cheers, Florian ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] how to get group members
Ivan Gromov wrote: Hi, all How to get group members? I use crm_resource -x -t group -r group_Name. Can I get members without xml part? What about crm configure show group-name ? Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Constraints works for one resource but not for another
Tobias Appel wrote: Hi, I have a very weird error with heartbeat version 2.14. I have two IPMI resources for my two nodes. The configuration is posted here: http://pastebin.com/m52c1809c node1 is named nagios1 node2 is named nagios2 now I have ipmi_nagios1 (which should run on nagios2 to shutdown nagios1) and ipmi_nagios2 (which should run on nagios1 to shutdown nagios2). It's confusing I know. Now I set up to constraints which force with score infinity a resource to only run on their designated node. For the resource ipmi_nagios2 it works without a problem. It only runs on nagios1 and is never started on nagios2. But the other resource which is identically configured (just the hostname differs) does not work - heartbeat always wants to start it on nagios1 and very seldom starts it on nagios2. Just now it failed to start on nagios1 and I hit clean up resource, waited a bit, failed again and after 3 times the cluster went havoc and turned off one of the nodes! I even tried to set a constraint via the UI - it's then labeled cli-constraint-name but even with this as well heartbeat still tried to start it on the wrong node! Now I'm really at a loss, maybe my configuration is wrong, or maybe it really is a bug in heartbeat. Here is the link to the configuration again: http://pastebin.com/m52c1809c I honestly don't know what to do anymore. I have to stop the ipmi service at the moment because otherwise it might randomly turn off one of the nodes, but without it we don't have any fencing so it's a quite delicate situation at the moment. Any input is greatly appreciated. Regards, Tobias The constraints look okay, but without logs, we cannot say why it does not do what you want. Also: look at the stonith device configuration: Is it okay for both primitives to have the same ip configured? I'd guess that will not successfully start the resource!? Maybe that's it already. I'd guess there was some failure before which brought up this situation (probably stonith start fail and stop fail)? It shouldn't turn off nodes at random. There's usually a pretty good reason when the cluster does this. Btw: the better way to make sure, a particular resource is only started on one node but never on the other is usually to configure -INFINITY for the other node instead of INFINITY for the node you want it to run on. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker 1.4 HBv2 1.99 // About quorum choice (contd.)
Alain.Moulle wrote: Hello Andrew, Could you explain why this functionnality is no more available (configuration lines remain in ha.cf) ? ipfail was replaced by pingd in v2. That was in the very first version of v2 afaik. And how should we proceed to avoid split-brain cases in a two-nodes cluster in case of problems on heartbeat network ? make network networks (plural) to reduce the chance of getting into a split-brain sitatuation and get and configure stonith devices to protect your data in case it happens anyways. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Command to see if a resource is started or not
Tobias Appel wrote: Hi, I need a command to see if a resource is started or not. Somehow my IPMI resource does not always start, especially on one node (for example if I reboot the node, or have a failover). There is no error and nothing, it just does nothing at all. Usually I have to clean up the resource and then it comes back by itself. This is not really a problem since this only occurs after a failover or reboot and when that happens, somebody usually takes a look at the cluster anyway. But some people forget to start it again, and when we do maintenance we have to turn it off on purpose since it would go wreck havoc and turn off one of the nodes. So all I need is a command line tool to check wether a resource is currently started or not. I tried to check the resources with the failcount command, but it's always 0. And the crm_resource command is used to configure a resource but does not seem to give me the status of a resource. I know I can use crm_mon but I would rather have a small command since I could include this in our monitoring tool (nagios). crm resource status resid Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Command to see if a resource is started or not
Tobias Appel wrote: On 08/05/2009 10:30 AM, Dominik Klein wrote: Tobias Appel wrote: So all I need is a command line tool to check wether a resource is currently started or not. I tried to check the resources with the failcount command, but it's always 0. And the crm_resource command is used to configure a resource but does not seem to give me the status of a resource. I know I can use crm_mon but I would rather have a small command since I could include this in our monitoring tool (nagios). crm resource statusresid Regards Dominik Thanks for the fast reply Dominik, I forgot to mention that I still run Heartbeat version 2.1.4. It seems crm_resource does not respond to the status flag. Or am I too stupid? It is not crm_resource, I meant crm resource (notice the blank). But the crm command is not in 2.1.4 Try crm_resource -W -r resid Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker 1.4 HBv2 1.99 // About quorum choice (contd.)
Alain.Moulle wrote: Thanks Andrew, 1. So my understanding is that in a more than 2 nodes cluster , if two nodes are failed, the have_quorum is set to 0 by the cluster soft and the behavior is choosen by the administrator with the no-quorum-policy parameter. So the question is now : what is the best choice for no-quorum-policy value ? My feeling is that ignore would be the best choice if all services can run without problems on the remaining healthy nodes. That's not the only case this can happen. If you run into split-brain, each node may be healthy but the network connections may be broken. With ignore, you will end up with resources running multiple times. That's a problem sometimes ;) Don't use ignore in 2 node clusters. suicide or stop : my understanding is that it will kill the remaining healthy nodes or stop the services running on them, so it does not sound good for me ... freeze : don't see the difference between freeze and ignore ... ? Am I right ? 2. and what about the quorum policy in a two-nodes cluster ? You need working stonith and policy=ignore, as no node can have 50% on its own. When the connection is lost, one node will shoot the other. The cluster software should not be started at boot time, otherwise you will end up in a stonith death match. There was quite a nice explanation on the pacemaker list some time ago. Look for STONITH Deathmatch Explained in the archives. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] updating cib without status attributes
You can try to compose the output of cibadmin -Q -o crm_config|resources|constraints to something usable for you. looks like I have to run the command once for each type and then concatenate the results. That's sort of what I meant to say. Sorry for being unclear. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Adding to a group without downtime
Gavin Hamill wrote: Hi :) I'm using the Lenny packages http://people.debian.org/~madkiss/ha/ and have been enjoying success with pacemaker + heartbeat (I've used a heartbeat v1 config for years without problems). I have a few IPaddr2 primitives in groups, but I'd like to understand how I can add a new primitive into an existing group without stopping/starting. At the moment I have to: # primitive failover-ip-186 ocf:heartbeat:IPaddr2 params ip=10.8.2.186 op monitor interval=10s # show www-frontends group www-frontends failover-ip-184 failover-ip-185 # delete www-frontends # group www-frontends failover-ip-184 failover-ip-185 failover-ip-186 # commit But that takes failover-ip-184 and failover-ip-185 down for a couple of seconds. Is there a way to add a new primitive to a group with zero downtime? I tried using 'edit www-frontends' to stick failover-ip-186 on the end of the line but it complains loudly Call cib_modify failed (-47): Update does not conform to the configured schema/DTD Which version? Can you post the actual in- and output? Afaik, that's supposed to work. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] updating cib without status attributes
Is there a query/config dump setting that will dump the running config to the command line without the status attributes? cibadmin -Q -o configuration ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] updating cib without status attributes
Dave Augustus wrote: On Mon, 2009-07-27 at 15:09 +0200, Dominik Klein wrote: Is there a query/config dump setting that will dump the running config to the command line without the status attributes? cibadmin -Q -o configuration What a quick reply! However I get: Call cib_query failed (-48): Invalid CIB section specified Might be quering configuration was only implemented later. Honestly no idea. You can try to compose the output of cibadmin -Q -o crm_config|resources|constraints to something usable for you. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Stonith with APC Smart UPS 1000 +Network ManagementCard
Ehlers, Kolja wrote: Yeah it supports SSH but if I log in using SSH there is just a menu to configure the card. Since I can enter only 2 digits at that prompt 1- Control 2- Diagnostics 3- Configuration 4- Detailed Status 5- About UPS ESC- Back, ENTER- Refresh, CTRL-L- Event Log I think it will not accept commands to shut down a node. But maybe someone knows better? That looks about like what the cfg menu of my AP7920 looks like. I've never seen a cmdline interface. Try activating snmp and manage outlets via snmp. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Stonith with APC Smart UPS1000 +Network ManagementCard
The stonith plugins are part of the heartbeat package iirc. I think there were reports in the past that those were built without snmp support and thus the apcmastersnmp plugin was not included. Maybe you have to build it yourself. I have no idea whether it will work for your device. But if it is just a different mib, it should not be too hard to modify. Regards Dominik Ehlers, Kolja wrote: Hello, do you think that the apcmastersnmp plugin will work? Dejan noted that the MIB probably will not fit. I have installed the pacemaker-mgmt-1.99.1-2.2.i586.rpm package and also am monitoring my cluster through snmp but the apcmastersnmp plugin is not installed. Is it in one of those packages: pacemaker-mgmt-client-1.99.2-1.1.i586.rpm pacemaker-mgmt-devel-1.99.2-1.1.i586.rpm -Ursprüngliche Nachricht- Von: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Dominik Klein Gesendet: Freitag, 10. Juli 2009 08:27 An: General Linux-HA mailing list Betreff: Re: [Linux-HA] Stonith with APC Smart UPS1000 +Network ManagementCard Ehlers, Kolja wrote: Yeah it supports SSH but if I log in using SSH there is just a menu to configure the card. Since I can enter only 2 digits at that prompt 1- Control 2- Diagnostics 3- Configuration 4- Detailed Status 5- About UPS ESC- Back, ENTER- Refresh, CTRL-L- Event Log I think it will not accept commands to shut down a node. But maybe someone knows better? That looks about like what the cfg menu of my AP7920 looks like. I've never seen a cmdline interface. Try activating snmp and manage outlets via snmp. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems Geschäftsführung: Dr. Michael Fischer, Reinhard Eisebitt Amtsgericht Köln HRB 32356 Steuer-Nr.: 217/5717/0536 Ust.Id.-Nr.: DE 204051920 -- This email transmission and any documents, files or previous email messages attached to it may contain information that is confidential or legally privileged. If you are not the intended recipient or a person responsible for delivering this transmission to the intended recipient, you are hereby notified that any disclosure, copying, printing, distribution or use of this transmission is strictly prohibited. If you have received this transmission in error, please immediately notify the sender by telephone or return email and delete the original transmission and its attachments without reading or saving in any manner. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resource set question
Steinhauer Juergen wrote: Hi guys! In my cluster setup, I have 6 IP addresses which should be started in parallel for speed purpose, and two apps, depending on the six addresses. What would be the best way to configure this? Putting all IPs in a group will start them one after another. Bad. Set ordered=false for the ip group. That will start them in parallel. I think. Then specify a resource order constraint to start your app group after the ip group and a colocation constraint to have the apps on the same node as the ips. Regards Dominik A colocation with a set including the IPs (sequential=false) and the two apps will not migrate the apps (and the remaining IPs), if one IP should fail... I'm happy about every proposal. Regards ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Master-slave, stopping a slave.
c smith wrote: Hi- I currently implement DRBD with Pacemaker. The DRBD resource is configured as a multi-state Master-slave resource in which node1 is the default master and node2 is the default slave. I am putting together a backup system that will run some automated scheduled tasks on node2 (assuming both nodes are online). Such tasks include disconnecting the DRBD resource temporarily and promoting it to primary/standalone. I'm wondering, is there a way to manually stop the child clone on that node before starting the backup, then restarting it after? Something along the lines of `crm resource ms-drbd:1 stop`. I have yet to test the system as it is but, from what I understand, if I manually disconnect the resource, Pacemaker's monitor functions will not be happy and likely throw monitor errors and/or try to restart and reconnect the resource. The Pacemaker's user guide does not go into detail regarding managing multi-state child resources. Is it possible to do without stopping the entire m-s resource? It has been stated that is it not intended to mess with the internals of child-management. Never to use child-instance-ids in constraints and so on. They will just cause you unwanted trouble. I'd probably just unmanage the drbd resource, then the cluster will not monitor it. Then do whatever you need to do and afterwards, let the cluster manage drbd again. Look into the is-managed meta attribute. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Master-slave, stopping a slave.
c smith wrote: Dominik- Thanks for the reply.. I'm aware that the documents advise against it, but surely there must be a way. I was just looking at the new DRBD 8.3.2. It includes a fencing handler script that, upon failure of a DRBD master, adds a -INFINITY location constraint into the CIB that prevents the MS resource from being 'Master' on the failed node until resync/uptodate (at which point its deleted from the CIB) Pasted from ./drbd-8.3.2/scripts/crm-fence-peer.sh: new_constraint=\ rsc_location rsc=\$master_id\ id=\$id_prefix-$master_id\ rule role=\Master\ score=\-INFINITY\ id=\$id_prefix-rule-$master_id\ expression attribute=\$fencing_attribute\ operation=\ne\ value=\$fencing_value\ id=\$id_prefix-expr-$master_id\/ /rule /rsc_location This constraint, when in place, succesfully stops the child from running on that node. Wondering if it is possible to do the same with the slave resource. I will also try out unmanage'ing the resource and see if that will accomplish similar. It will not stop the slave. It will unmanage the entire m/s drbd resource, meaning the cluster will no longer monitor its status. So if you disconnect drbd and do your backup business of whatever kind, the cluster will not notice. Then you can re-connect drbd and let pacemaker manage the resource again. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Add resource to a group
David Hoskinson wrote: I tested the lsb for amavis and it passed all the tests. I was seeing timeout errors in the logs saying over 20s. By specifying only timeout instead of both interval and timeout in my start op is that what you mean... Such as Primitive amavisd lsb:amavisd op monitor timeout=45s You're supposed to add a _start_ op, not a _monitor_ op. Like primitive ... op start timeout=45s Regards Dominik Thanks for the help On 6/25/09 8:44 AM, David Hoskinson dhoskin...@eng.uiowa.edu wrote: Hmmm I understand I will try this in a bit, thanks for the tip On 6/25/09 1:20 AM, Dominik Klein d...@in-telegence.net wrote: David Hoskinson wrote: Thanks Got it going again. However my amavisd service fails with a unknown exec error. Its the only one that won't work, and isn't related to the group question. I have it setup the same as postfix, dovecot, etc. Primitive amavisd lsb:amavisd op monitor interval=30s timeout=30s Is that amavisd script LSB compliant? See http://www.linux-ha.org/LSBResourceAgent for how to check on that. Just wondering if its taking too long to start or what... If you have any ideas that would be great. If start takes longer than the default operation timeout (20s by default), then you should see that in the logs. You can work around that by specifying a start op with timeout=whatever amount of time you need. Make sure you do NOT set an interval for the start op. Seems to be a common mistake. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Add resource to a group
David Hoskinson wrote: Thanks Got it going again. However my amavisd service fails with a unknown exec error. Its the only one that won't work, and isn't related to the group question. I have it setup the same as postfix, dovecot, etc. Primitive amavisd lsb:amavisd op monitor interval=30s timeout=30s Is that amavisd script LSB compliant? See http://www.linux-ha.org/LSBResourceAgent for how to check on that. Just wondering if its taking too long to start or what... If you have any ideas that would be great. If start takes longer than the default operation timeout (20s by default), then you should see that in the logs. You can work around that by specifying a start op with timeout=whatever amount of time you need. Make sure you do NOT set an interval for the start op. Seems to be a common mistake. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Failover problem
The default value for stonith-enabled is true. If you however do not have a stonith device, that will give you an endless loop of unsuccessfully trying to shoot the other node before doing anything else to the resources the dead node was running. try crm configure property stonith-enabled=false crm configure property no-quorum-policy=ignore Be warned though! Your cluster should not go into production like this! David Hoskinson wrote: Im sorry this is maybe where my knowledge is lacking. I don't have the hardware for a third node, but I understand your reasoning Don't understand how to add stonith and haven't found a good document for that... I also get No STONITH resources have been defined when I do a crm_verify -LV Don't know how to set quorom policy to ignore. Which of the last 2 would you suggest, and where to look for info on how to do it. thanks On 6/24/09 3:26 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Wed, Jun 24, 2009 at 02:05:46PM -0500, David Hoskinson wrote: System running 2.99 heartbeat and pacemaker 1.04. Running fine in master slave mode. However if I shut down the slave server, all the services stop on the master until the slave comes back up, does the election and once again starts the services on the master. This doesn't seem to be the way it should be. Same thing if I shut the master down. Services go off line until master is back up. Two node cluster, one vote down, 50% is NOT majority - single node has no quorum. Quorum policy probably says: no quorum - stop. You need to - add more nodes (just to have a real quorum), and/or - add stonith, and/or - set quorum policy to ignore. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Monitoring resources
Koen Verwimp wrote: Hi! I have defined a resources called rg_alfresco_ip . This resource consists of a OCF script (AlfrescoIP). This is script is a copy of IPAddr but with a customized status/monitoring procedure. group id= rg_alfresco_ip primitive class=ocf type=AlfrescoIP provider=heartbeat id=ip_AS meta_attributes attributes nvpair id=Alfresco_ip_meta2 name=resource_failure_stickiness value=-INFINITY/ /attributes /meta_attributes instance_attributes id=ip_alfresco_attributes attributes nvpair id=IPaddr2_AS_pair1 name=ip value=192.168.103.52/ nvpair id=IPaddr2_AS_pair2 name=nic value=eth0/ nvpair id=IPaddr2_AS_pair3 name=iflabel value=VIP_AS/ nvpair id=IPaddr2_AS_pair4 name=broadcast value=192.68.103.255/ nvpair id=IPaddr2_AS_pair5 name=cird_netmask value=255.255.255.0/ /attributes /instance_attributes operations op id=alfresco_ip_op1 name=start timeout=60s prereq=fencing on_fail=restart/ op id=alfresco_ip_op2 name=stop timeout=60s on_fail=fence/ op id=alfresco_ip_op3 start_delay=30s name=monitor interval=30s timeout=10s on_fail=restart/ /operations /primitive /group Because I have configured a constraint (see below), rg_alfresco_ip will be started on his default node (DocuCluster03). rule id=prefered_location_alfresco score=100 expression attribute=#uname id=prefered_location_alfresco_expr operation=eq value=DocuCluster03 / /rule You can see the monitor operation defined the resource group above. If the monitor status returns a fail code, the on_fail procedure is set on “restart”. Because the resource_failure_stickiness is equal to –INFINITY, the resource will be restarted/fail over on my second server (DocuCluster04). Problem: If rg_alfresco_ip also fails on the second node (DocuCluster04), it don’t try to fail_over on node1 (DocuCluster03) again. Anyone an idea why Heartbeat doesn’t try to fail over back on node1 (DocuCluster03)? The score for both nodes is negative, ie the resource cannot run there. End of story. Do I have to reset some error counter on node1? Yes. Try crm_failcount --help And - since you're speaking of resource failure stickiness - please consider upgrading to pacemaker. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problems With SLES11 + DRBD
darren.mans...@opengi.co.uk wrote: Hello everyone. Long post, sorry. I've been trying to get SLES11 with Pacemaker 1.0 / OpenAIS working for most of this week without success so far. I thought I may as well bundle my problems into one mail to see if anyone can offer any advice. Goal: I'm trying to get a 2 node Active/Passive cluster working with DRBD replication, an ext3 FS on top of DRBD and a virtual IP. I want the active node to have a mounted FS that I can serve requests from using ProFTPD or another FTP daemon. If the active node fails I want the cluster to migrate all 4 resources (DRBD, FS, ProFTPD, Virtual IP) across to the other node. I don't have any STONITH devices at the moment. Approach: We are going with SLES11 with Pacemaker 1.0.3 and OpenAIS 0.80.3, after already using SLES10SP2 with Heartbeat 2.1.4 and ldirectord in a live running 2-node Active/Active cluster. We are using LVM under DRBD for future disk expansion. Problem1 - Using DRBD OCF RA: I wanted to use the latest and greatest for the approaches, so tried the DRBD OCF RA following this howto: http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 . The configuration works and I can manually migrate resources but if I just reboot the node that has the drbd resource on it I see the resource gets migrated to the other node for about 2 seconds then is stopped: Normal operation: Last updated: Fri May 1 16:33:00 2009 Current DC: gihub2 - partition with quorum And this is your reason. The no-quorum-policy default is stop (you even configured it, see below), which means do not run any resources if you do not have qorum. The node is alone, so it does not have quorum. If you want it to run things anyway, set no-quorum-policy to ignore. That would be the old heartbeat behaviour. Version: 1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a 2 Nodes configured, 2 expected votes 1 Resources configured. Online: [ gihub1 gihub2 ] drbd0 (ocf::heartbeat:drbd): Started gihub1 Reboot gihub1: Last updated: Fri May 1 16:35:34 2009 Current DC: gihub2 - partition with quorum Version: 1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a 2 Nodes configured, 2 expected votes 1 Resources configured. Online: [ gihub2 ] OFFLINE: [ gihub1 ] drbd0 (ocf::heartbeat:drbd): Started gihub2 Then after a couple of seconds: Last updated: Fri May 1 16:37:11 2009 Current DC: gihub2 - partition WITHOUT quorum Version: 1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a 2 Nodes configured, 2 expected votes 1 Resources configured. Online: [ gihub2 ] OFFLINE: [ gihub1 ] /var/log/messages says: May 1 16:46:33 gihub2 openais[5362]: [TOTEM] The token was lost in the OPERATIONAL state. May 1 16:46:33 gihub2 openais[5362]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes). May 1 16:46:33 gihub2 openais[5362]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). May 1 16:46:33 gihub2 openais[5362]: [TOTEM] entering GATHER state from 2. May 1 16:46:36 gihub2 kernel: drbd0: conn( WFConnection - Disconnecting ) May 1 16:46:36 gihub2 kernel: drbd0: Discarding network configuration. May 1 16:46:36 gihub2 kernel: drbd0: Connection closed May 1 16:46:36 gihub2 kernel: drbd0: conn( Disconnecting - StandAlone ) May 1 16:46:36 gihub2 kernel: drbd0: receiver terminated May 1 16:46:36 gihub2 kernel: drbd0: Terminating receiver thread May 1 16:46:36 gihub2 kernel: drbd0: disk( UpToDate - Diskless ) May 1 16:46:36 gihub2 kernel: drbd0: drbd_bm_resize called with capacity == 0 May 1 16:46:36 gihub2 kernel: drbd0: worker terminated May 1 16:46:36 gihub2 kernel: drbd0: Terminating worker thread May 1 16:46:36 gihub2 openais[5362]: [TOTEM] entering GATHER state from 0. May 1 16:46:36 gihub2 openais[5362]: [TOTEM] Creating commit token because I am the rep. May 1 16:46:36 gihub2 openais[5362]: [TOTEM] Saving state aru 6b high seq received 6b May 1 16:46:36 gihub2 lrmd: [5370]: info: rsc:drbd0: stop May 1 16:46:36 gihub2 cib: [5369]: notice: ais_dispatch: Membership 400: quorum lost May 1 16:46:36 gihub2 openais[5362]: [TOTEM] Storing new sequence id for ring 190 May 1 16:46:36 gihub2 openais[5362]: [TOTEM] entering COMMIT state. May 1 16:46:36 gihub2 openais[5362]: [TOTEM] entering RECOVERY state. May 1 16:46:36 gihub2 openais[5362]: [TOTEM] position [0] member 2.21.4.41: May 1 16:46:36 gihub2 openais[5362]: [TOTEM] previous ring seq 396 rep 2.21.4.40 May 1 16:46:36 gihub2 openais[5362]: [TOTEM] aru 6b high delivered 6b received flag 1 May 1 16:46:36 gihub2 openais[5362]: [TOTEM] Did not need to originate any messages in recovery. May 1 16:46:36 gihub2 openais[5362]: [TOTEM] Sending
Re: [Linux-HA] Problems With SLES11 + DRBD
Dominik Klein wrote: darren.mans...@opengi.co.uk wrote: Hello everyone. Long post, sorry. I've been trying to get SLES11 with Pacemaker 1.0 / OpenAIS working for most of this week without success so far. I thought I may as well bundle my problems into one mail to see if anyone can offer any advice. Goal: I'm trying to get a 2 node Active/Passive cluster working with DRBD replication, an ext3 FS on top of DRBD and a virtual IP. I want the active node to have a mounted FS that I can serve requests from using ProFTPD or another FTP daemon. If the active node fails I want the cluster to migrate all 4 resources (DRBD, FS, ProFTPD, Virtual IP) across to the other node. I don't have any STONITH devices at the moment. Approach: We are going with SLES11 with Pacemaker 1.0.3 and OpenAIS 0.80.3, after already using SLES10SP2 with Heartbeat 2.1.4 and ldirectord in a live running 2-node Active/Active cluster. We are using LVM under DRBD for future disk expansion. Problem1 - Using DRBD OCF RA: I wanted to use the latest and greatest for the approaches, so tried the DRBD OCF RA following this howto: http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 . The configuration works and I can manually migrate resources but if I just reboot the node that has the drbd resource on it I see the resource gets migrated to the other node for about 2 seconds then is stopped: Normal operation: Last updated: Fri May 1 16:33:00 2009 Current DC: gihub2 - partition with quorum And this is your reason. Bla. (read below) The no-quorum-policy default is stop (you even configured it, see below), which means do not run any resources if you do not have qorum. The node is alone, so it does not have quorum. If you want it to run things anyway, set no-quorum-policy to ignore. That would be the old heartbeat behaviour. Version: 1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a 2 Nodes configured, 2 expected votes 1 Resources configured. Online: [ gihub1 gihub2 ] drbd0 (ocf::heartbeat:drbd): Started gihub1 Reboot gihub1: Last updated: Fri May 1 16:35:34 2009 Current DC: gihub2 - partition with quorum Version: 1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a 2 Nodes configured, 2 expected votes 1 Resources configured. Online: [ gihub2 ] OFFLINE: [ gihub1 ] drbd0 (ocf::heartbeat:drbd): Started gihub2 Then after a couple of seconds: Last updated: Fri May 1 16:37:11 2009 Current DC: gihub2 - partition WITHOUT quorum Here you are without quorum. Sorry. Regards Dominik Version: 1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a 2 Nodes configured, 2 expected votes 1 Resources configured. Online: [ gihub2 ] OFFLINE: [ gihub1 ] /var/log/messages says: May 1 16:46:33 gihub2 openais[5362]: [TOTEM] The token was lost in the OPERATIONAL state. May 1 16:46:33 gihub2 openais[5362]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes). May 1 16:46:33 gihub2 openais[5362]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). May 1 16:46:33 gihub2 openais[5362]: [TOTEM] entering GATHER state from 2. May 1 16:46:36 gihub2 kernel: drbd0: conn( WFConnection - Disconnecting ) May 1 16:46:36 gihub2 kernel: drbd0: Discarding network configuration. May 1 16:46:36 gihub2 kernel: drbd0: Connection closed May 1 16:46:36 gihub2 kernel: drbd0: conn( Disconnecting - StandAlone ) May 1 16:46:36 gihub2 kernel: drbd0: receiver terminated May 1 16:46:36 gihub2 kernel: drbd0: Terminating receiver thread May 1 16:46:36 gihub2 kernel: drbd0: disk( UpToDate - Diskless ) May 1 16:46:36 gihub2 kernel: drbd0: drbd_bm_resize called with capacity == 0 May 1 16:46:36 gihub2 kernel: drbd0: worker terminated May 1 16:46:36 gihub2 kernel: drbd0: Terminating worker thread May 1 16:46:36 gihub2 openais[5362]: [TOTEM] entering GATHER state from 0. May 1 16:46:36 gihub2 openais[5362]: [TOTEM] Creating commit token because I am the rep. May 1 16:46:36 gihub2 openais[5362]: [TOTEM] Saving state aru 6b high seq received 6b May 1 16:46:36 gihub2 lrmd: [5370]: info: rsc:drbd0: stop May 1 16:46:36 gihub2 cib: [5369]: notice: ais_dispatch: Membership 400: quorum lost May 1 16:46:36 gihub2 openais[5362]: [TOTEM] Storing new sequence id for ring 190 May 1 16:46:36 gihub2 openais[5362]: [TOTEM] entering COMMIT state. May 1 16:46:36 gihub2 openais[5362]: [TOTEM] entering RECOVERY state. May 1 16:46:36 gihub2 openais[5362]: [TOTEM] position [0] member 2.21.4.41: May 1 16:46:36 gihub2 openais[5362]: [TOTEM] previous ring seq 396 rep 2.21.4.40 May 1 16:46:36 gihub2 openais[5362]: [TOTEM] aru 6b high delivered 6b received flag 1 May 1 16:46:36 gihub2 openais[5362]: [TOTEM] Did not need to originate any messages in recovery. May 1 16:46:36 gihub2 openais
[Linux-ha-dev] Patch: RA mysql
Trivial. See attached patch. Regards Dominik exporting patch: # HG changeset patch # User Dominik Klein d...@in-telegence.net # Date 1240578752 -7200 # Node ID 2d97904c385cc9b4779286001611bd748f48589d # Parent 60cc2d6eee88ff6c2dedf7b539b9ee018efda6da Low: RA mysql: Correctly remove eventually remaining socket diff -r 60cc2d6eee88 -r 2d97904c385c resources/OCF/mysql --- a/resources/OCF/mysql Fri Apr 24 08:38:48 2009 +0200 +++ b/resources/OCF/mysql Fri Apr 24 15:12:32 2009 +0200 @@ -419,7 +419,7 @@ ocf_log info MySQL stopped; rm -f /var/lock/subsys/mysqld -rm -f $OCF_RESKEY_datadir/mysql.sock +rm -f $OCF_RESKEY_socket return $OCF_SUCCESS } ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] Assymetric Clustering
fsalas wrote: Hi, I'm quite new to clustering and HeartBeat, but as far as I know, a very nice packages. Well, here is my problem, I'm willing to setup a cluster for an small enterprise that will have several services located in virtual machines, to make it simpler, let's say we have four nodes in the cluster, in two nodes will run service A and in the other two nodes will run service B. I would like to make it in one cluster and not in two miniclusters because of later migration posibilities, easier administration,etc. I'm working with Ubuntu server 8.10 , with heartbeat-2 distrib packages. Now, Ive setup the first two nodes, with service A with no problem , service A is lsb, in my test is simply an apt-proxy with a virtual IP and a drbd for shared storage. After learning how to do it, it works flawesly, crm_verify didn't complain , all just fine. I've setup location rules to only let service A to run in these two nodes Then I decided to add the other two nodes to the cluster, to continue with service B, and here is the problem, even if these new nodes aren't allowed to run service A, it seems that the CRM tells the LRM to monitor service A on these nodes, as apt-proxy and drbd are not even installed there, it complains with errors with drbd, and failed on apt-proxy. AM I missing something here, or those test shouldn't be there, as location forbids those nodes for running this services.:,( I would really appreciate any light you can bring on this, as Im struggling with it for the last days. thanx in advance, and my apologies if my english is not as good as it should! :-D We can only guess if you dont share configuration files and logs, but I guess what you see is the probe operations returning not installed. A probe is run on every node to find out in which state the resource is there before doing anything to the resource. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Restart a service without the dependent services restarting?
Noah Miller wrote: Hi - Is it possible to restart a clustered service (v2 cluster) without its dependent services also stopping and starting? When the constraint score is advisory (0), dependencies should not be restarted, but then they are not really dependencies in the sense of the word. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Re: Stopping the Heartbeat daemon does not stop the DRBD Daemon
Joe Bill wrote: Stopping the Heartbeat daemon (service heartbeat stop) does not stop the DRBD daemon even if it is one of the resources. - Heartbeat and DRBD are 2 different products/packages - Like most services, DRBD doesn't need Heartbeat to run. You can set up and run DRBD volumes without Heartbeat installed, or any cluster supervisor. - The DRBD daemons provide the communication interface for each network volume and are therefor an integral part of the volume management. Without the DRBD daemons, you (manually) and Heartbeat (automagically) could not handle the DRBD volumes. Just to avoid confusion: There is no such thing as a DRBD daemon. DRBD is a kernel module. - If you look carefully at your startup, DRBD daemons start whether or not Heartbeat is started. That depends on your setup. Maybe in yours it does and it should. In others it does not and it should not. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Re: Re: Re: Stopping the Heartbeat daemon does not stop the DRBD Daemon
- The DRBD daemons provide the communication interface for each network volume and are therefor an integral part of the volume management. Without the DRBD daemons, you (manually) and Heartbeat (automagically) could not handle the DRBD volumes. Just to avoid confusion: There is no such thing as a DRBD daemon. DRBD is a kernel module. Now I'm the one confused. What are these processes that show up when I ps -ef ? root..25621..0..2008..?00:00:00 [drbd7_worker] root.175581..0..2008..?00:00:00 [drbd7_receiver] root.246471..0.Jan02..?00:00:27 [drbd7_asender] Doesn't the '1'---^ here, mean 'root' detached ? Those are the kernel threads (indicated by the enclosing []) Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] HA Books
darren.mans...@opengi.co.uk wrote: Hi. Can anyone recommend any good books about HA with regards to the latest incarnations such as Pacemaker etc? I understand enough about the CRM and heartbeat 2 to get by but lots of the stuff on this list still goes over my head. Thanks. Darren Mansell There was a discussion about books in january. Have a look in the archives. Bottom line: There is one (1) book. It is in german and it is (no offense Michael) somewhat outdated at least in some parts. Iirc, it was written with version 2.1.3 I'd personally just go through the pdf docs from the clusterlabs site. They cover a lot. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Stopping the Heartbeat daemon does not stop the DRBD Daemon
Jerome Yanga wrote: Stopping the Heartbeat daemon (service heartbeat stop) does not stop the DRBD daemon even if it is one of the resources. # service heartbeat stop Stopping High-Availability services: [ OK ] # service drbd status drbd driver loaded OK; device status: version: 8.2.7 (api:88/proto:86-88) GIT-hash: 61b7f4c2fc34fe3d2acf7be6bcc1fc2684708a7d build by r...@nomen.esri.com, 2009-03-24 08:29:57 m:res csst ds p mounted fstype 0:r0 Unconfigured It stops your drbd resource (device). It just does not unload the module. That is the expected behaviour. Regards Dominik Running the command below stops the DRBD daemon. Service drbd stop Applications Installed: === drbd-8.2.7-3 heartbeat-2.99.2-6.1 pacemaker-1.0.2-11.1 CIB.xml: # crm configure show primitive fs0 ocf:heartbeat:Filesystem \ params fstype=ext3 directory=/data device=/dev/drbd0 primitive VIP ocf:heartbeat:IPaddr \ params ip=10.50.26.250 \ op monitor interval=5s timeout=5s primitive drbd0 ocf:heartbeat:drbd \ params drbd_resource=r0 \ op monitor interval=59s role=Master timeout=30s \ op monitor interval=60s role=Slave timeout=30s group DRBD_Group fs0 VIP \ meta collocated=true ordered=true migration-threshold=1 failure-timeout=10s resource-stickiness=10 ms ms-drbd0 drbd0 \ meta clone-max=2 notify=true globally-unique=false target-role=Started colocation DRBD_Group-on-ms-drbd0 inf: DRBD_Group ms-drbd0:Master order ms-drbd0-before-DRBD_Group inf: ms-drbd0:promote DRBD_Group:start Help. Regards, jerome ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pingd/pacemaker
I know. But this attrbiut does not exist in my setup. pacemaker verison 1.0.1-1. Is this a feature of 1.0.2? 1.0.1 is 4 months old. The RA was updated with those features 3 months ago. So basically, yes. You could still update the single RA from the mercurial repository though. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] showscores.sh for pacemaker 1.0.2
So here's an update. Michael Schwartzkopf pointed out a bug regarding groups. That has been fixed now and the appropriate values should be shown. Thanks! There's not been a lot of feedback, is it because nobody uses the script or does it just work for you? Regards Dominik Dominik Klein wrote: Hi I made the necessary changes to the showscores script to work with pacemaker 1.0.2. Please test and report problems. Has been reported to work by some people and should go into the repository soon. Still, I'd like more people to test and confirm. Important changes: * correctly fetch stickiness and migration-threshold for complex resources (master and clone) * adjust column-width according to the length of resources' and nodes' names Regards Dominik showscores.sh Description: Bourne shell script ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat v2 stickiness, score and more
florian.engelm...@bt.com wrote: Hello, I spent the whole afternoon to search for a good heartbeat v2 documentation, but it looks like this is somehow difficult. Maybe someone in here can help me? Anyway I have a short question about stickiness. I only know about sun cluster but I have to build up knowledge about heartbeat cluster since we are running two debian heartbeat clusters now. Those two are failover clusters providing web services, nagios and vserver virtual hosts. Let's say resource_group_a is running on node1 and resource_group_b on node2. If I reboot node2 resource_group_b will switch to node1. But if node2 is up again resource_group_b will switch back to node2. That is what I don't want the cluster to do. No switchback... How can I do that? crm_attribute --type crm_config --attr-name default-resource-stickiness --attr-value INFINITY Old versions might need _ instead of - in the attr-name. If yours does, take that as a hint that you should upgrade your cluster software. And which command is used to switch one resource group to another node (not marking any node as standby)? crm_resource Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pingd/pacemaker
Michael Schwartzkopff wrote: Hi, I am testing the pingd from the provider pacemaker. As Dominik told me, there is no need to define ping nodes in the ha.cf any more. OK so far. As I see pingd tries to reach all pingnodes of the hostlist attribute every 10 seconds. Is it possible to pass an attribute to the pingd deamon to have it sending out ICMP echo request every second of every 3 seconds? Thanks. interval ;) Take a look at the metadata for available parameters. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pingd/pacemaker
Michael Schwartzkopff wrote: Am Dienstag, 31. März 2009 15:27:47 schrieb Dominik Klein: Michael Schwartzkopff wrote: Hi, I am testing the pingd from the provider pacemaker. As Dominik told me, there is no need to define ping nodes in the ha.cf any more. OK so far. As I see pingd tries to reach all pingnodes of the hostlist attribute every 10 seconds. Is it possible to pass an attribute to the pingd deamon to have it sending out ICMP echo request every second of every 3 seconds? Thanks. interval ;) Take a look at the metadata for available parameters. Regards Dominik Perhaps I am blind, but I also thought about that. Doesn't work for me. Please can anybody help me. My pingd clone is: primitive class=ocf id=clone_ping-primitive provider=pacemaker type=pingd meta_attributes id=clone_ping-primitive-meta_attributes/ operations id=clone_ping-primitive-operations op enabled=true id=clone_ping-primitive-operations-op interval=3s name=monitor on-fail=ignore requires=nothing start- delay=1m timeout=20/ That's the interval of your monitor operation. /operations instance_attributes id=clone_ping-primitive-instance_attributes nvpair id=nvpair-48efdbbc-ee3f-4382-b8ac-10200bdc6ca1 name=host_list value=192.168.188.2 192.168.188.3/ nvpair id=nvpair-edb09383-2f1c-4d1c-b671-23598109fbeb name=dampen value=3s/ nvpair id=nvpair-ce8cb00a-9406-447f-8c08-af65fae26ff4 name=multiplier value=100/ nvpair id=pingd-interval name=interval value=3/ /instance_attributes /primitive /clone When I tcpdump on the interface I see an ICMP echo request all 10 seconds on the line. When I said look at the metadata, I meant: export OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/pacemaker/pingd meta-data Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Beginner questions
Juha Heinanen wrote: Juha Heinanen writes: the real problem is that start of mysql server by pacemaker stops altogether after a few manual stops (/etc/init.d/mysql stop). i think i figured this out. when pacemaker needed to start my mysql-server resource three times on node lenny1, it migrated the group to node lenny2. when i then repeated stoping of mysql-server on lenny2, it migrated the group back to lenny1, but didn't start mysql-server, because it remembered that it had already started it there 3 times. if so, my conclusion is to forget migration-threshold parameter. That sounds about right. You can configure a failure-timeout. That's an amount of time after which the cluster forgets about failures. Read up on failure timeout and don't miss the section how to ensure time based rules take effect in the pdf documentation. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Beginner questions
Les Mikesell wrote: My first HA setup is for a squid proxy where all I need is to move an IP address to a backup server if the primary fails (and the cache can just rebuild on its own). This seems to work, but will only fail if the machine goes down completely or the primary IP is unreachable. Is that typical or are there monitors for the service itself so failover would happen if the squid process is not running or stops accepting connections? Second question (unrelated): Can heartbeat be set up so one or two spare machines could automatically take over the IP address of any of a much larger pool of machines that might fail? Heartbeat in v1 mode (haresources configuration) cannot do any resource level monitoring itself. You'd need to do that externally by any means. If you're just starting out learning now, I'd suggest going with openais and pacemaker instead of heartbeat right away. Check out the documentation on www.clusterlabs.org/wiki/install and www.clusterlabs.org/wiki/Documentation Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Beginner questions
Juha Heinanen wrote: Dominik Klein writes: Heartbeat in v1 mode (haresources configuration) cannot do any resource level monitoring itself. You'd need to do that externally by any means. yes, in v2 mode i have managed to make pacemaker to monitor resources, for example, like this: primitive test lsb:test \ op monitor interval=30s timeout=5s \ meta target-role=Started but i still have failed to find out how to make pacemaker to migrate a resource group to another node if one of the resources in the group fails to start. for example, if test is the last member of group group test-group fs0 mysql-server virtual-ip test and fails to start, the group is not migrated to another node. i have tried to add primitive test lsb:test op monitor interval=30s timeout=5s meta migration-threshold=3 but it just stopped monitoring of test after 3 attempts. any ideas how to achieve migration? I read your email on the pacemaker list and from what you've shared and explained, i cannot spot find a configuration issue. It should just work like that (and does work like that for me). Maybe post your entire configuration, preferrably a hb_report archive. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] maintenance-mode of pengine
Michael Schwartzkopff wrote: Hi, In the metadata of the pengine I found the attribute maintenance-mode. I did not find any documentation about it. The long description also says: Should the cluster Anybody knows what this options does? Thanks. It disables resource management when set to true. Like is-managed-default did in the old days, plus, irrc, it also disables all ops. But better let Andrew verify the latter. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] expected-quorum-votes
crmd metadata tells me that expected-quorum-votes are used to calculate quorum in openais based clusters. Its default value is 2. Do I have to change this value if I have 3 or more nodes in a OpenAIS based cluster? No. It is automatically adjusted by the cluster. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat degrades drbd resource
You cannot use drbd in heartbeat the way you configured it. Please refer to http://wiki.linux-ha.org/DRBD/HowTov2 and (if that wasn't made clear enough on the page) make sure the first thing you do is upgrade your cluster software. Read here on how to do that: http://clusterlabs.org/wiki/Install Regards Dominik Heiko Schellhorn wrote: Hi I installed drbd (8.0.14) together with heartbeat (2.0.8) on a Gentoo-system. I have following problem: Standalone the drbd resource works perfectly. I can mount/unmount it alternate on both nodes. Reading/writing works and /proc/drbd looks fine. But when I start heartbeat it degrades the resource step by step until it's marked as unconfigured. An excerpt of the logfile is attached. Heartbeat itself starts up and runs. Two of the three resources configured up to now are also working. Only drbd shows problems. (See the file crm_mon-out) I don't think it's a problem of communication between the nodes because drbd is working standalone and e.g. the IPaddr2 resource is also working within heartbeat. I also tried several heartbeat-configurations. First I defined the resources as single resources and then I combined the resources to a resource group. There was no difference. Has someone seen such an issue before? Any ideas ? I didn't find anything helpful in the list archive. If you need more informations I can provide a complete log and the config. Thanks Heiko ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat degrades drbd resource
Dominik Klein wrote: You cannot use drbd in heartbeat the way you configured it. Please refer to http://wiki.linux-ha.org/DRBD/HowTov2 Sorry, copy/paste error. I meant to say http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Beginner questions
Is there some documentation available for openais? I can't even find a good description of what it does or why you would use it. Also, will this help with my 2nd question: having a few spares for a large number of servers? While my objective with the squid cache is to proxy everything through one server to maximize the cache hits, I may switch to memcached on a group of machines and would like to have a standby or 2 that could take over for any failing machine. Well, there are man-pages and the mailing list. The install page even has a configuration example. And I have found this thread to be especially helpful: https://lists.linux-foundation.org/pipermail/openais/2009-March/010894.html openais will be the future platform for pacemaker clusters providing the communication infrastructure and node failure detection. Heartbeat will for example no longer be part of the next suse enterprise linux (sles11) ha solution. It will be based on openais. So for new setups, this should be the way to go - at least imho. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] drbd RA issue in (heartbeat 2.1.4 + drbd-8.3.0)
Dejan Muhamedagic wrote: Hi, On Wed, Mar 18, 2009 at 11:37:27AM -0700, Neil Katin wrote: Dejan Muhamedagic wrote: Hi, On Tue, Mar 17, 2009 at 11:56:04AM +0530, Arun G wrote: Hi, I observed below error message when I upgraded drbd to drbd-8.3.0 in heartbeat 2.1.4 cluster on 2.6.18-8.el5xen. -- snip -- Thanks for the patch. But do all supported drbd versions have the role command? Thanks, Dejan No, only 8.3 has the change. 8.2 supports the old state argument, but prints a warning message out, and this warning message upsets the drbd OCF scripts parting of drbdadm's output. Since versions before 8.3 don't have the role command, I suppose that 8.3 actually prints the warning. drbdadm doesn't support a --version argument, but it does support a status command, which has version info in it. However, I am not sure if drbdadm status is guaranteed to not block or not, so I didn't want to have the OCF script depend on it. drbdadm | grep Version works for 8.2.7 and 8.0.14, so I guess that it is available in other versions too. So, I see three alternatives: add a new script drbdadm8.3. Add an extra parameter saying use role instead of status. Or call drbdadm status to dynamically detect our version. Do you see other choices? Do you have a preference for a particular alternative? I'm willing to code and test the patch if we can decide what we want. Let's see if we can figure out the version. Adding new RA would be a maintenance issue. Adding new parameter would make configuration depend on particular release. We could do something like this: drbdadm | grep Version | awk '{print $2}' | awk -F. ' $1 != 8 { exit 2; } This should also allow version 7. People may still use v7. The drbdadm | grep thing also works. Tested with latest v7 in a vm. It prints # drbdadm | grep Version | awk '{print $2}' 0.7.25 though. Regards Dominik $2 3 { exit 1; } # use status # otherwise use role ' rc=$? if [ $rc -eq 2 ]; then error installed (unsupported version) elif [ $rc -eq 1 ]; then cmd=status else cmd=role fi Could you please try this out. I can't test this right now. Thanks, Dejan ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pingnodes in openais
Michael Schwartzkopff wrote: Hi, As far as I know pingnodes have to be configured in heartbeat. heartbeat pings the nodes and updates the CIB. Where can I configure pingnodes, when I use OpenAIS as the cluster stack? Create a pingd clone resource in the CIB. It's the preferred way of running pingd anyway. S.th. like primitive pingd ocf:pacemaker:pingd \ params host_list=1.2.3.4 5.6.7.8 interval=5 dampen=10 \ op monitor interval=30 timeout=30 clone cl-pingd pingd Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Having issues with getting DRBD to work with Pacemaker
Hi Jerome Yanga wrote: Dominik, As usual, you are right on the money. I should have caught that myself. Thank you for catching that for me. What happened was that I used a different server to compile DRBD and I had assumed that Nomen and Rubic (my test nodes) were on the same kernel. Moreover, I had also combined Neil's suggestion to yours as he had mentioned that pacemaker-1.0.1 and drbd-8.2 works. My current issues are as follows: 1) I cannot migrate the resource fs0 from Nomen to Rubric. Running the command crm resource migrate fs0 just puts fs0 to offline state. This sounds like a config change. NOTE: I am planning to add fs0 into a Group that will be able to migrate between the two nodes (Nomen and Rubric). Help. Please provide the crm(live) syntax as I have tried the ones below and crm complains that the syntax is wrong. order ms-drbd0-before-fs0 mandatory: ms-drbd0:promote fs0:start colocation fs0-on-ms-drbd0 inf: fs0 ms-drbd0:Master You need 1.0.2 for that. 1.0.1 packages' crm shell had a bug there. 2) Is there a documentation for what resources, constraints and the like I can add into the cib.xml via crm(live)? Moreover, their syntax to add them via crm(live)? http://clusterlabs.org/wiki/Documentation --snip-- cib.xml: cib admin_epoch=0 validate-with=pacemaker-1.0 crm_feature_set=3.0 have-quorum=1 epoch=153 num_updates=0 cib-last-written=Fri Mar 6 12:52:27 2009 dc-uuid=3a8b681c-a14b-4037-a8e6-2d4af2eff88e configuration crm_config cluster_property_set id=cib-bootstrap-options nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.0.1-node: 6fc5ce8302abf145a02891ec41e5a492efbe8efe/ nvpair id=cib-bootstrap-options-last-lrm-refresh name=last-lrm-refresh value=1236213117/ /cluster_property_set /crm_config nodes node id=3a8b681c-a14b-4037-a8e6-2d4af2eff88e uname=nomen.esri.com type=normal/ node id=a5e95310-f27d-418e-9cb9-42e50310f702 uname=rubric.esri.com type=normal/ /nodes resources master id=ms-drbd0 meta_attributes id=ms-drbd0-meta_attributes nvpair id=ms-drbd0-meta_attributes-clone-max name=clone-max value=2/ nvpair id=ms-drbd0-meta_attributes-notify name=notify value=true/ nvpair id=ms-drbd0-meta_attributes-globally-unique name=globally-unique value=false/ nvpair id=ms-drbd0-meta_attributes-target-role name=target-role value=Started/ /meta_attributes primitive class=ocf id=drbd0 provider=heartbeat type=drbd instance_attributes id=drbd0-instance_attributes nvpair id=drbd0-instance_attributes-drbd_resource name=drbd_resource value=r0/ /instance_attributes operations id=drbd0-ops op id=drbd0-monitor-59s interval=59s name=monitor role=Master timeout=30s/ op id=drbd0-monitor-60s interval=60s name=monitor role=Slave timeout=30s/ /operations /primitive /master primitive class=ocf id=VIP provider=heartbeat type=IPaddr instance_attributes id=VIP-instance_attributes nvpair id=VIP-instance_attributes-ip name=ip value=10.50.26.250/ /instance_attributes operations id=VIP-ops op id=VIP-monitor-5s interval=5s name=monitor timeout=5s/ /operations /primitive primitive class=ocf id=fs0 provider=heartbeat type=Filesystem instance_attributes id=fs0-instance_attributes nvpair id=fs0-instance_attributes-fstype name=fstype value=ext3/ nvpair id=fs0-instance_attributes-directory name=directory value=/data/ nvpair id=fs0-instance_attributes-device name=device value=/dev/drbd0/ /instance_attributes /primitive /resources constraints/ You don't have any constraints, so migrate fs0 will fail and not take drbd into account. /configuration /cib messages: == Mar 6 12:56:07 nomen lrmd: [14509]: info: Resource Agent output: [] Mar 6 12:56:08 nomen crm_shadow: [1551]: info: Invoked: crm_shadow Mar 6 12:56:08 nomen crm_shadow: [1565]: info: Invoked: crm_shadow Mar 6 12:56:08 nomen crm_resource: [1566]: info: Invoked: crm_resource -M -r fs0 Mar 6 12:56:09 nomen cib: [14508]: info: cib_process_request: Operation complete: op cib_delete for section constraints (origin=local/crm_resource/3): ok (rc=0) Mar 6 12:56:09 nomen haclient: on_event:evt:cib_changed Mar 6 12:56:09 nomen crmd: [14603]: info: abort_transition_graph: need_abort:60 - Triggered transition abort (complete=1) : Non-status change Mar 6 12:56:09 nomen crmd: [14603]: info: need_abort: Aborting on change to epoch Mar 6 12:56:09 nomen crmd: [14603]: info: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Mar 6 12:56:09 nomen
Re: [Linux-HA] Having issues with getting DRBD to work with Pacemaker
Hi Jerome Yanga wrote: Hi! I am having issues with getting DRBD to work with Pacemaker. I can get Pacemaker and DRBD run individually but not DRBD managed by Pacemaker. I tried following the instruction in the site below but the resources will not go online. http://clusterlabs.org/wiki/DRBD_HowTo_1.0 Below is my configuration. Installed applications: === kernel-2.6.18-128.el5 copy that drbd-8.3.0-3 heartbeat-2.99.2-6.1 pacemaker-1.0.1-3.1 drbd.conf: == global { usage-count no; } resource r0 { protocol C; handlers { pri-on-incon-degr echo o /proc/sysrq-trigger ; halt -f; pri-lost-after-sb echo o /proc/sysrq-trigger ; halt -f; local-io-error echo o /proc/sysrq-trigger ; halt -f; outdate-peer /usr/lib/heartbeat/drbd-peer-outdater -t 5; pri-lost echo pri-lost. Have a look at the log files. | mail -s 'DRBD Alert' root; out-of-sync /usr/lib/drbd/notify-out-of-sync.sh root; } startup { wfc-timeout 0; } disk { on-io-error pass_on; } net { max-buffers 2048; after-sb-0pri disconnect; after-sb-1pri disconnect; after-sb-2pri disconnect; rr-conflict disconnect; } syncer { rate 100M; al-extents 257; } on nomen.esri.com { device /dev/drbd0; disk /dev/sda5; address192.168.0.1:7789; meta-disk internal; } on rubric.esri.com { device/dev/drbd0; disk /dev/sda5; address 192.168.0.2:7789; meta-disk internal; } } Cib.xml: cib admin_epoch=0 validate-with=pacemaker-1.0 crm_feature_set=3.0 have-quorum=1 dc-uuid=a5 e95310-f27d-418e-9cb9-42e50310f702 epoch=56 num_updates=0 cib-last-written=Wed Mar 4 14:27:59 2009 configuration crm_config cluster_property_set id=cib-bootstrap-options nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.0.1-node: 6fc5ce830 2abf145a02891ec41e5a492efbe8efe/ /cluster_property_set /crm_config nodes node id=3a8b681c-a14b-4037-a8e6-2d4af2eff88e uname=nomen.esri.com type=normal/ node id=a5e95310-f27d-418e-9cb9-42e50310f702 uname=rubric.esri.com type=normal/ /nodes resources master id=ms-drbd0 meta_attributes id=ms-drbd0-meta_attributes nvpair id=ms-drbd0-meta_attributes-clone-max name=clone-max value=2/ nvpair id=ms-drbd0-meta_attributes-notify name=notify value=true/ nvpair id=ms-drbd0-meta_attributes-globally-unique name=globally-unique value=false / nvpair name=target-role id=ms-drbd0-meta_attributes-target-role value=Started/ /meta_attributes primitive class=ocf id=drbd0 provider=heartbeat type=drbd instance_attributes id=drbd0-instance_attributes nvpair id=drbd0-instance_attributes-drbd_resource name=drbd_resource value=r0/ /instance_attributes operations id=drbd0-ops op id=drbd0-monitor-59s interval=59s name=monitor role=Master timeout=30s/ op id=drbd0-monitor-60s interval=60s name=monitor role=Slave timeout=30s/ /operations /primitive /master /resources constraints/ /configuration /cib /var/log/messages: == Mar 4 14:27:58 nomen crm_resource: [30167]: info: Invoked: crm_resource --meta -r ms-drbd0 -p target-role -v Started Mar 4 14:27:58 nomen cib: [29899]: info: cib_process_xpath: Processing cib_query op for //cib/configuration/resources//*...@id=ms-drbd0]//meta_attributes//nvpa...@name=target-role] (/cib/configuration/resources/master/meta_attributes/nvpair[4]) Mar 4 14:27:59 nomen crmd: [29903]: info: do_lrm_rsc_op: Performing key=5:5:0:d4b86e31-ca4a-4033-8437-6486622eb19f op=drbd0:0_start_0 ) Mar 4 14:27:59 nomen haclient: on_event:evt:cib_changed Mar 4 14:27:59 nomen lrmd: [29900]: info: rsc:drbd0:0: start Mar 4 14:27:59 nomen cib: [30168]: info: write_cib_contents: Wrote version 0.56.0 of the CIB to disk (digest: 2365d9802f1b9c55e0ed87b8ebda5db3) Mar 4 14:27:59 nomen cib: [30168]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig) Mar 4 14:27:59 nomen cib: [29899]: info: Managed write_cib_contents process 30168 exited with return code 0. Mar 4 14:27:59 nomen modprobe: FATAL: Module drbd not found. Mar 4 14:27:59 nomen lrmd: [29900]: info: RA output: (drbd0:0:start:stdout) Mar 4 14:27:59 nomen mgmtd: [29904]: info: CIB query: cib Mar 4 14:27:59 nomen lrmd: [29900]: info: RA output: (drbd0:0:start:stdout) Could not stat(/proc/drbd): No such file or directory do you need to load the module? try: modprobe drbd Command 'drbdsetup /dev/drbd0 disk /dev/sda5 /dev/sda5 internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 20 drbdadm attach
[Linux-HA] showscores.sh for pacemaker 1.0.2
Hi I made the necessary changes to the showscores script to work with pacemaker 1.0.2. Please test and report problems. Has been reported to work by some people and should go into the repository soon. Still, I'd like more people to test and confirm. Important changes: * correctly fetch stickiness and migration-threshold for complex resources (master and clone) * adjust column-width according to the length of resources' and nodes' names Regards Dominik showscores.sh Description: Bourne shell script ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] showscores for pacamaker-1.0
showscores gives me: ~# ./showscores.sh ResourceScore NodeStickiness #FailFail- Stickiness 50 0 on 50 on 50 on 50 on 50 on 50 on 50 on 50 on 50 not the result I expected originally. @Dominik: Any chance to fix the script? ptest -s That is the kind of program I was looking for. Is there any explanation how group_color and native_color are defined? Is there any update the the linux-ha.org/ScoreCalculation page for pacemaker =1.0? Thanks for enlightening answers. I have a version that deals with most of the new things. I will post it here soon. If you want to test now, send me a private email. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] HA debug message
Tears ! wrote: Dear members! I have first time Install heartbeat on Slackware 12.2. I have enable debugging in ha.cf Here is the some debug message i want to describe here. Feb 14 23:01:15 haServer1 heartbeat: [15131]: WARN: Core dumps could be lost if multiple dumps occur. Feb 14 23:01:15 haServer1 heartbeat: [15131]: WARN: Consider setting non-default value in /proc/sys/kernel/core_pattern (or equivalent) for maximum supportability Feb 14 23:01:15 haServer1 heartbeat: [15131]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability What about Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability do you not understand? Now i just want to ask you are these message are realy serious? if yes then what should i do? They're not serious in a meaning you have done anything wrong or something is not working correctly. It's just a suggestion. Set those parameters and the developers might be able to debug your problems easier in case you hit coredumps. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD in a 2 node cluster
Did you look into the returncodes and eventually tell linbit about it? That would be a big issue. Regards Dominik jayfitzpatr...@gmail.com wrote: (Big Cheers and celebrations from this end!!!) Finally figured out what the problem was, it seems that the kernel oops were being caused by the 8.3 version of DRBD, once downgraded to 8.2.7 everything started to work as it should. primary / secondary automatic fail over is in place and resources are now following the DRBD master! Thanks a mill for all the help. Jason On Feb 12, 2009 8:48am, Jason Fitzpatrick jayfitzpatr...@gmail.com wrote: Hi Dominik thanks again for the feedback, I had noticed some kernel opps's since the last kernel update that i and they seem to be pointing to DRBD, i will downgrade the kernel again and see if this improves things, re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9 but must have missed this bit. user land and kernel module all report the same version. I am on my way into the office now and I will apply the changes once there thanks again Jason 2009/2/12 Dominik Klein d...@in-telegence.net Right, this one looks better. I'll refer to nodes as 1001 and 1002. 1002 is your DC. You have stonith enabled, but no stonith devices. Disable stonith or get and configure a stonith device (_please_ dont use ssh). 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l 978). Retries in l 1018 and fails again in l 1035. Then, the cluster tries to start drbd on 1001 in l 1079, followed by a bunch of kernel messages I don't understand (pretty sure _this_ is the first problem you should address!), ending up in the drbd RA not able to see the secondary state (1449) and considering the start failed. The RA code for this is if do_drbdadm up $RESOURCE ; then drbd_get_status if [ $DRBD_STATE_LOCAL != Secondary ]; then ocf_log err $RESOURCE start: not in Secondary mode after start. return $OCF_ERR_GENERIC fi ocf_log debug $RESOURCE start: succeeded. return $OCF_SUCCESS else ocf_log err $RESOURCE: Failed to start up. return $OCF_ERR_GENERIC fi The cluster then successfully stops drbd again (l 1508-1511) and tries to start the other clone instance (l 1523). Log says RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in Secondary mode after start. So this is interesting. Although stop (basically drbdadm down) succeeded, the drbd device is still attached! Please try: stop the cluster drbdadm up $resource drbdadm up $resource #again echo $? drbdadm down $resource echo $? cat /proc/drbd Btw: Does your userland match your kernel module version? To bring this to an end: The start of the second clone instance also failed, so both instances are unrunnable on the node and no further start is tried on 1002. Interestingly, then (could not see any attempt before), the cluster wants to start drbd on node 1001, but it also fails and also gives those kernel messages. In l 2001, each instance has a failed start on each node. So: Find out about those kernel messages. Can't help much on that unfortunately, but there were some threads about things like that on drbd-user recently. Maybe you can find answers to that problem there. And also: please verify returncodes of drbdadm in your case. Maybe that's a drbd tools bug? (can't say for sure, for me, up on an alreay up resource gives 1, which is ok). Regards Dominik Jason Fitzpatrick wrote: it seems that I had the incorrect version of openais installed (from the fedora repo vs the HA one) I have corrected and the hb_report ran correctly using the following hb_report -u root -f 3pm /tmp/report Please see attached Thanks again Jason ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Is it possible to cleanly take down a resource in a v1 config?
Hi heartbeat in v1 mode does not do resource monitoring by itself. So if you did not set up any custom resource monitoring, you can just stop your application in whatever way you normally do that and re-start it whenever you like. v1 clusters will not notice. They only see node state changes. Regards Dominik Malcolm Turnbull wrote: With a two node linux-ha cluster you can add an ip address to haresources and then do a hb_takeover, and it will bring the interface up cleanly. But is their a way of taking down an resource cleanly? (just removing it from haresources and re-booting not being a good answer) or do you need to do a manual ifconfig eth0:x down... and if so how do you evaluate which x it is? (awk I guess?) Thanks. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-ha-dev] Patch: RA anything
Hi I fixed most of the things Lars mentioned in http://hg.linux-ha.org/dev/rev/15bcf3491f9c and will explain why I did not fix some of them. ocf-tester runs fine with the RA. # FIXME: This should use pidofproc pidofproc is not available everywhere and is not able to get down to command line options, eg could not tell the difference between $process $option_a and $process $option_b which I wanted to support with this agent. Example: dktest3:~/src/linuxha/hg/dev # sleep 200 [1] 5799 dktest3:~/src/linuxha/hg/dev # sleep 300 [2] 5801 dktest3:~/src/linuxha/hg/dev # pidofproc sleep 5801 5799 dktest3:~/src/linuxha/hg/dev # pidofproc sleep 300 # FIXME: use start_daemon start_daemon is not available everywhere either. # FIXME: What about daemons which can manage their own pidfiles? This agent is meant to be used for programs that are not actually daemons by design. It is meant to be able to run sth stupid in the cluster. Even like /bin/sleep 1000 # FIXME: use killproc This is also a problem with $process $option_a and $process $option_b. You can't just killproc $process then. # FIXME: Attributes special meaning to the resource id I tried to, but couldn't understand what you meant here. I also talked to Dejan on IRC and we agreed that anything is a bad name for the RA and the changeset description was propably bad, too. This RA is not for (as the cs stated) arbitrary daemons, it is more for daemonizing programs which were not meant to be daemons. If a proper name comes to anyone's mind - please share. Hopefully, now it is a bit clearer what I wanted to be able to do with this RA. I agree the cmd= lines and pid file creation are very very ugly, but I could not yet find a better way. Not that much of a shell genius I guess :( Please share if you can improve things. Regards Dominik exporting patch: # HG changeset patch # User Dominik Klein d...@in-telegence.net # Date 1234350091 -3600 # Node ID 04533b37813c8be009814f52de7b14ff65bf9862 # Parent 90ff997faa7288248ac57583b0c03df4c8e41bda RA: anything. Implement most of lmbs suggestions. diff -r 90ff997faa72 -r 04533b37813c resources/OCF/anything --- a/resources/OCF/anything Wed Feb 11 11:31:02 2009 +0100 +++ b/resources/OCF/anything Wed Feb 11 12:01:31 2009 +0100 @@ -32,6 +32,7 @@ # OCF_RESKEY_errlogfile # OCF_RESKEY_user # OCF_RESKEY_monitor_hook +# OCF_RESKEY_stop_timeout # # This RA starts $binfile with $cmdline_options as $user and writes a $pidfile from that. # If you want it to, it logs: @@ -47,18 +48,20 @@ # Initialization: . ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs -getpid() { # make sure that the file contains a number - # FIXME: pidfiles could contain spaces - grep '^[0-9][0-9]*$' $1 +getpid() { +grep -o '[0-9]*' $1 } anything_status() { - # FIXME: This should use pidofproc - # FIXME: pidfile w/o process means the process died, so should - # be ERR_GENERIC - if test -f $pidfile pid=`getpid $pidfile` kill -0 $pid + if test -f $pidfile then - return $OCF_RUNNING + if pid=`getpid $pidfile` kill -0 $pid + then + return $OCF_RUNNING + else + # pidfile w/o process means the process died + return $OCF_ERR_GENERIC + fi else return $OCF_NOT_RUNNING fi @@ -66,8 +69,6 @@ anything_start() { if ! anything_status - # FIXME: use start_daemon - # FIXME: What about daemons which can manage their own pidfiles? then if [ -n $logfile -a -n $errlogfile ] then @@ -101,29 +102,48 @@ } anything_stop() { - # FIXME: use killproc +if [ -n $OCF_RESKEY_stop_timeout ] +then +stop_timeout=$OCF_RESKEY_stop_timeout +elif [ -n $OCF_RESKEY_CRM_meta_timeout ]; then +# Allow 2/3 of the action timeout for the orderly shutdown +# (The origin unit is ms, hence the conversion) +stop_timeout=$((OCF_RESKEY_CRM_meta_timeout/1500)) +else +stop_timeout=10 +fi if anything_status then - pid=`getpid $pidfile` - kill $pid - i=0 - # FIXME: escalate to kill -9 before timeout - while sleep 1 - do - if ! anything_status - then -rm -f $pidfile /dev/null 21 -return $OCF_SUCCESS - fi - let i++ - done +pid=`getpid $pidfile` +kill $pid +rm -f $pidfile +i=0 +while [ $i -lt $stop_timeout ] +do +while sleep 1 +do +if ! anything_status +then +return $OCF_SUCCESS +fi +let i++ +done +done +ocf_log warn Stop with SIGTERM failed/timed out, now sending SIGKILL. +kill -9 $pid +if ! anything_status
Re: [Linux-HA] failed dependencies while installing heartbeat 2.99.2-6.1
Gerd König wrote: Hi Dominik, thanks for answering quickly, but there were no dependencies found: #zypper search openipmi * Lese installierte Pakete [100%] Keine möglichen Abhängigkeiten gefunden. Do I need some additional software repositories ? I don't think so. The packages should be available in your distribution. Since you're using opensuse, you can also try yast software management and see whether you can find the packages. They should come on the dvd I think. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Failovercluster considered one node down but state transition did not happen succesfully
Zemke, Kai wrote: Hi, I'm running a two node failover cluster. Yesterday the cluster tried to manage a state transition. In the log files I found the following entries: heartbeat[6905]: 2009/02/10_21:45:55 WARN: node nagios-drbd2: is dead heartbeat[6905]: 2009/02/10_21:45:55 info: Link nagios-drbd2:eth1 dead. A few minutes later the node that was still alive tried to take over the resources and created the following entries in the log file ( the resource ipaddress is an example, there are a lot more entries for the other resources that were running on the cluster ): pengine[7370]: 2009/02/10_21:45:59 WARN: custom_action: Action resource_nagios_ipaddress_stop_0 on nagios-drbd2 is unrunnable (offline) pengine[7370]: 2009/02/10_21:45:59 WARN: custom_action: Marking node nagios-drbd2 unclean Further more there a several entries telling: stonithd[6916]: 2009/02/10_21:46:30 ERROR: Failed to STONITH the node nagios-drbd2: optype=RESET, op_result=TIMEOUT The stonith is running via ssh on a direct link between the to nodes. Since Node2 was down the shutdown command never reached its destination. Which is why ssh stonith is not meant for production. My Questions are: Why did the alive cluster try to stop resources on a cluster node that is considered as dead? Why did STONITH try to shut down a node that is considered down? ( for safety reasons I think ) It is considered dead, but that does not have to be a fact. By shooting it, the cluster makes the assumption a fact (turn it off or reboot it). Shouldn't the resources just be started on the alive node without any further action? Not until the cluster knows the other node is dead. Who knows what's going on there if it cannot be communicated with. Did I miss something in the default behaviour of heartbeat? Maybe a timeout? Would a hardware STONITH device solve such problems in the future? Yes. Regards Dominik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD in a 2 node cluster
Hi Jason Jason Fitzpatrick wrote: I have disabled the services and run drbdadm secondary all drbdadm detach all drbdadm down all service drbd stop before testing as far as I can see (cat /proc/drbd on both nodes) drbd is shutdown cat: /proc/drbd: No such file or directory Good. I have taken the command that heartbeat is running (drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on') The RA actually runs drbdadm up, which translates into this. and run it against the nodes when heartbeat is not in control and this command will bring the resources online, but re-running this command will generate the error, so I am kind of leaning twords the command being run twice? Never seen the cluster do that. Please post your configuration and logs. hb_report should gather everything needed and put it into a nice .bz2 archive :) Regards Dominik Thanks Jason 2009/2/11 Dominik Klein d...@in-telegence.net Hi Jason any chance you started drbd at boot or the drbd device was active at the time you started the cluster resource? If so, read the introduction of the howto again and correct your setup. Jason Fitzpatrick wrote: Hi Dominik I have upgraded to HB 2.9xx and have been following the instructions that you provided (thanks for those) and have added a resource as follows crm configure primitive Storage1 ocf:heartbeat:drbd \ params drbd_resource=Storage1 \ op monitor role=Master interval=59s timeout=30s \ op monitor role=Slave interval=60s timeout=30s ms DRBD_Storage Storage1 \ meta clone-max=2 notify=true globally-unique=false target-role=stopped commit exit no errors are reported and the resource is visable from within the hb_gui when I try to bring the resource online with crm resource start DRBD_Storage I see the resource attempt to come online and then fail, it seems to be starting the services, changing the status of the devices to attached (from detached) but not setting any device to master the following is from the ha-log crmd[8020]: 2009/02/10_17:22:32 info: do_lrm_rsc_op: Performing key=7:166:0:b57f7f7c-4e2d-4134-9c14-b1a2b7db11a7 op=Storage1:1_start_0 ) lrmd[8016]: 2009/02/10_17:22:32 info: rsc:Storage1:1: start lrmd[8016]: 2009/02/10_17:22:32 info: RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 This looks like drbdadm up is failing because the device is already attached to the lower level storage device. Regards Dominik drbd[22270]:2009/02/10_17:22:32 ERROR: Storage1 start: not in Secondary mode after start. crmd[8020]: 2009/02/10_17:22:32 info: process_lrm_event: LRM operation Storage1:1_start_0 (call=189, rc=1, cib-update=380, confirmed=true) complete unknown e rror . I have checked the DRBD device Storage1 and it is in secondary mode after the start, and should I choose I can make it primary on either node Thanks Jason 2009/2/10 Jason Fitzpatrick jayfitzpatr...@gmail.com Thanks, This was the latest version in the Fedora Repos, I will upgrade and see what happens Jason 2009/2/10 Dominik Klein d...@in-telegence.net Jason Fitzpatrick wrote: Hi All I am having a hell of a time trying to get heartbeat to fail over my DRBD harddisk and am hoping for some help. I have a 2 node cluster, heartbeat is working as I am able to fail over IP Addresses and services successfully, but when I try to fail over my DRBD resource from secondary to primary I am hitting a brick wall, I can fail over the DRBD resource manually so I know that it does work at some level DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386 Please upgrade. Thats too old for reliable master/slave behaviour. Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read http://www.clusterlabs.org/wiki/Install for install notes. and using heartbeat-gui to configure Don't use the gui to configure complex (ie clone or master/slave) resources. Once you upgraded to the latest pacemaker, please refer to http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster configuration. Regards Dominik DRBD Resource is called Storage1, the 2 nodes are connected via 2 x-over cables (1 heartbeat 1 Replication) I have stripped down my config to the bare bones and tried every option that I can think off but know that I am missing something simple, I have attached my cib.xml but have removed domain names from the systems for privacy reasons Thanks in advance Jason cib admin_epoch=0 have_quorum=true ignore_dtd=false cib_feature_revision=2.0 num_peers=2 generated=true ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48 epoch=733 num_updates=1 cib-last-written=Mon Feb 9 18:31:19 2009 configuration crm_config
Re: [Linux-HA] DRBD in a 2 node cluster
The archive only contains info for one node and the logfile is empty. Did you use appropriate -f time and does ssh work between the nodes? So far, nothing obvious to me except for the order between your FS and DRBD lacking the role definition, but that's not what your problem is about (yet *g*). Regards Dominik Jason Fitzpatrick wrote: Hi Dominik Thanks for the follow up, please find the file attached Jason ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD in a 2 node cluster
Right, this one looks better. I'll refer to nodes as 1001 and 1002. 1002 is your DC. You have stonith enabled, but no stonith devices. Disable stonith or get and configure a stonith device (_please_ dont use ssh). 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l 978). Retries in l 1018 and fails again in l 1035. Then, the cluster tries to start drbd on 1001 in l 1079, followed by a bunch of kernel messages I don't understand (pretty sure _this_ is the first problem you should address!), ending up in the drbd RA not able to see the secondary state (1449) and considering the start failed. The RA code for this is if do_drbdadm up $RESOURCE ; then drbd_get_status if [ $DRBD_STATE_LOCAL != Secondary ]; then ocf_log err $RESOURCE start: not in Secondary mode after start. return $OCF_ERR_GENERIC fi ocf_log debug $RESOURCE start: succeeded. return $OCF_SUCCESS else ocf_log err $RESOURCE: Failed to start up. return $OCF_ERR_GENERIC fi The cluster then successfully stops drbd again (l 1508-1511) and tries to start the other clone instance (l 1523). Log says RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in Secondary mode after start. So this is interesting. Although stop (basically drbdadm down) succeeded, the drbd device is still attached! Please try: stop the cluster drbdadm up $resource drbdadm up $resource #again echo $? drbdadm down $resource echo $? cat /proc/drbd Btw: Does your userland match your kernel module version? To bring this to an end: The start of the second clone instance also failed, so both instances are unrunnable on the node and no further start is tried on 1002. Interestingly, then (could not see any attempt before), the cluster wants to start drbd on node 1001, but it also fails and also gives those kernel messages. In l 2001, each instance has a failed start on each node. So: Find out about those kernel messages. Can't help much on that unfortunately, but there were some threads about things like that on drbd-user recently. Maybe you can find answers to that problem there. And also: please verify returncodes of drbdadm in your case. Maybe that's a drbd tools bug? (can't say for sure, for me, up on an alreay up resource gives 1, which is ok). Regards Dominik Jason Fitzpatrick wrote: it seems that I had the incorrect version of openais installed (from the fedora repo vs the HA one) I have corrected and the hb_report ran correctly using the following hb_report -u root -f 3pm /tmp/report Please see attached Thanks again Jason ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD in a 2 node cluster
Dominik Klein wrote: Right, this one looks better. I'll refer to nodes as 1001 and 1002. 1002 is your DC. You have stonith enabled, but no stonith devices. Disable stonith or get and configure a stonith device (_please_ dont use ssh). 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l 978). Retries in l 1018 and fails again in l 1035. Then, the cluster tries to start drbd on 1001 in l 1079, s/1001/1002 sorry ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD in a 2 node cluster
Jason Fitzpatrick wrote: Hi All I am having a hell of a time trying to get heartbeat to fail over my DRBD harddisk and am hoping for some help. I have a 2 node cluster, heartbeat is working as I am able to fail over IP Addresses and services successfully, but when I try to fail over my DRBD resource from secondary to primary I am hitting a brick wall, I can fail over the DRBD resource manually so I know that it does work at some level DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386 Please upgrade. Thats too old for reliable master/slave behaviour. Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read http://www.clusterlabs.org/wiki/Install for install notes. and using heartbeat-gui to configure Don't use the gui to configure complex (ie clone or master/slave) resources. Once you upgraded to the latest pacemaker, please refer to http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster configuration. Regards Dominik DRBD Resource is called Storage1, the 2 nodes are connected via 2 x-over cables (1 heartbeat 1 Replication) I have stripped down my config to the bare bones and tried every option that I can think off but know that I am missing something simple, I have attached my cib.xml but have removed domain names from the systems for privacy reasons Thanks in advance Jason cib admin_epoch=0 have_quorum=true ignore_dtd=false cib_feature_revision=2.0 num_peers=2 generated=true ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48 epoch=733 num_updates=1 cib-last-written=Mon Feb 9 18:31:19 2009 configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair id=cib-bootstrap-options-dc-version name=dc-version value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/ nvpair name=last-lrm-refresh id=cib-bootstrap-options-last-lrm-refresh value=1234204278/ /attributes /cluster_property_set /crm_config nodes node id=df707752-d5fb-405a-8ca7-049e25a227b7 uname=lpissan1001 type=normal instance_attributes id=nodes-df707752-d5fb-405a-8ca7-049e25a227b7 attributes nvpair id=standby-df707752-d5fb-405a-8ca7-049e25a227b7 name=standby value=off/ /attributes /instance_attributes /node node id=9d8abc28-4fa3-408a-a695-fb36b0d67a48 uname=lpissan1002 type=normal instance_attributes id=nodes-9d8abc28-4fa3-408a-a695-fb36b0d67a48 attributes nvpair id=standby-9d8abc28-4fa3-408a-a695-fb36b0d67a48 name=standby value=off/ /attributes /instance_attributes /node /nodes resources master_slave id=Storage1 meta_attributes id=Storage1_meta_attrs attributes nvpair id=Storage1_metaattr_target_role name=target_role value=started/ nvpair id=Storage1_metaattr_clone_max name=clone_max value=2/ nvpair id=Storage1_metaattr_clone_node_max name=clone_node_max value=1/ nvpair id=Storage1_metaattr_master_max name=master_max value=1/ nvpair id=Storage1_metaattr_master_node_max name=master_node_max value=1/ nvpair id=Storage1_metaattr_notify name=notify value=true/ nvpair id=Storage1_metaattr_globally_unique name=globally_unique value=false/ /attributes /meta_attributes primitive id=Storage1 class=ocf type=drbd provider=heartbeat instance_attributes id=Storage1_instance_attrs attributes nvpair id=273a1bb2-4867-42dd-a9e5-7cebbf48ef3b name=drbd_resource value=Storage1/ /attributes /instance_attributes operations op id=9ddc0ce9-4090-4546-a7d5-787fe47de872 name=monitor description=master interval=29 timeout=10 start_delay=1m role=Master/ op id=56a7508f-fa42-46f8-9924-3b284cdb97f0 name=monitor description=slave interval=29 timeout=10 start_delay=1m role=Slave/ /operations /primitive /master_slave /resources constraints/ /configuration /cib ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD in a 2 node cluster
Hi Jason any chance you started drbd at boot or the drbd device was active at the time you started the cluster resource? If so, read the introduction of the howto again and correct your setup. Jason Fitzpatrick wrote: Hi Dominik I have upgraded to HB 2.9xx and have been following the instructions that you provided (thanks for those) and have added a resource as follows crm configure primitive Storage1 ocf:heartbeat:drbd \ params drbd_resource=Storage1 \ op monitor role=Master interval=59s timeout=30s \ op monitor role=Slave interval=60s timeout=30s ms DRBD_Storage Storage1 \ meta clone-max=2 notify=true globally-unique=false target-role=stopped commit exit no errors are reported and the resource is visable from within the hb_gui when I try to bring the resource online with crm resource start DRBD_Storage I see the resource attempt to come online and then fail, it seems to be starting the services, changing the status of the devices to attached (from detached) but not setting any device to master the following is from the ha-log crmd[8020]: 2009/02/10_17:22:32 info: do_lrm_rsc_op: Performing key=7:166:0:b57f7f7c-4e2d-4134-9c14-b1a2b7db11a7 op=Storage1:1_start_0 ) lrmd[8016]: 2009/02/10_17:22:32 info: rsc:Storage1:1: start lrmd[8016]: 2009/02/10_17:22:32 info: RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults --create-device --on-io-error=pass_on' terminated with exit code 10 This looks like drbdadm up is failing because the device is already attached to the lower level storage device. Regards Dominik drbd[22270]:2009/02/10_17:22:32 ERROR: Storage1 start: not in Secondary mode after start. crmd[8020]: 2009/02/10_17:22:32 info: process_lrm_event: LRM operation Storage1:1_start_0 (call=189, rc=1, cib-update=380, confirmed=true) complete unknown e rror . I have checked the DRBD device Storage1 and it is in secondary mode after the start, and should I choose I can make it primary on either node Thanks Jason 2009/2/10 Jason Fitzpatrick jayfitzpatr...@gmail.com Thanks, This was the latest version in the Fedora Repos, I will upgrade and see what happens Jason 2009/2/10 Dominik Klein d...@in-telegence.net Jason Fitzpatrick wrote: Hi All I am having a hell of a time trying to get heartbeat to fail over my DRBD harddisk and am hoping for some help. I have a 2 node cluster, heartbeat is working as I am able to fail over IP Addresses and services successfully, but when I try to fail over my DRBD resource from secondary to primary I am hitting a brick wall, I can fail over the DRBD resource manually so I know that it does work at some level DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386 Please upgrade. Thats too old for reliable master/slave behaviour. Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read http://www.clusterlabs.org/wiki/Install for install notes. and using heartbeat-gui to configure Don't use the gui to configure complex (ie clone or master/slave) resources. Once you upgraded to the latest pacemaker, please refer to http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster configuration. Regards Dominik DRBD Resource is called Storage1, the 2 nodes are connected via 2 x-over cables (1 heartbeat 1 Replication) I have stripped down my config to the bare bones and tried every option that I can think off but know that I am missing something simple, I have attached my cib.xml but have removed domain names from the systems for privacy reasons Thanks in advance Jason cib admin_epoch=0 have_quorum=true ignore_dtd=false cib_feature_revision=2.0 num_peers=2 generated=true ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48 epoch=733 num_updates=1 cib-last-written=Mon Feb 9 18:31:19 2009 configuration crm_config cluster_property_set id=cib-bootstrap-options attributes nvpair id=cib-bootstrap-options-dc-version name=dc-version value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/ nvpair name=last-lrm-refresh id=cib-bootstrap-options-last-lrm-refresh value=1234204278/ /attributes /cluster_property_set /crm_config nodes node id=df707752-d5fb-405a-8ca7-049e25a227b7 uname=lpissan1001 type=normal instance_attributes id=nodes-df707752-d5fb-405a-8ca7-049e25a227b7 attributes nvpair id=standby-df707752-d5fb-405a-8ca7-049e25a227b7 name=standby value=off/ /attributes /instance_attributes /node node id=9d8abc28-4fa3-408a-a695-fb36b0d67a48 uname=lpissan1002 type=normal instance_attributes id=nodes-9d8abc28-4fa3-408a-a695-fb36b0d67a48 attributes nvpair id=standby-9d8abc28-4fa3-408a-a695-fb36b0d67a48
Re: [Linux-HA] failed dependencies while installing heartbeat 2.99.2-6.1
Gerd König wrote: Hello list, I wanted to start with heartbeat using the latest sources for OpenSuse10.3 64bit. I've downloaded these rpm's: heartbeat-2.99.2-6.1.x86_64.rpm heartbeat-common-2.99.2-6.1.x86_64.rpm heartbeat-debuginfo-2.99.2-6.1.x86_64.rpm heartbeat-resources-2.99.2-6.1.x86_64.rpm libheartbeat2-2.99.2-6.1.x86_64.rpm libopenais2-0.80.3-12.2.x86_64.rpm libpacemaker3-1.0.1-3.1.x86_64.rpm libpacemaker-devel-1.0.1-3.1.x86_64.rpm openais-0.80.3-12.2.x86_64.rpm pacemaker-1.0.1-3.1.x86_64.rpm pacemaker-debuginfo-1.0.1-3.1.x86_64.rpm pacemaker-pygui-1.4-11.9.x86_64.rpm pacemaker-pygui-debuginfo-1.4-11.9.x86_64.rpm pacemaker-pygui-devel-1.4-11.9.x86_64.rpm and started to install them, but I'm stuck in installing heartbeat-common package. The command rpm -Uvh heartbeat-common-2.99.2-6.1.x86_64.rpm produces this error message: rpm -Uvh heartbeat-common-2.99.2-6.1.x86_64.rpm warning: heartbeat-common-2.99.2-6.1.x86_64.rpm: Header V3 DSA signature: NOKEY, key ID 1d362aeb error: Failed dependencies: libOpenIPMI.so.0()(64bit) is needed by heartbeat-common-2.99.2-6.1.x86_64 libOpenIPMIposix.so.0()(64bit) is needed by heartbeat-common-2.99.2-6.1.x86_64 libOpenIPMIutils.so.0()(64bit) is needed by heartbeat-common-2.99.2-6.1.x86_64 Well, looks like you need some openipmi library packages. Try zypper search openipmi and install the appropriate packages. Regards Dominik What I've installed so far: rpm -qa | egrep -i heart|openai|pace libheartbeat2-2.99.2-6.1 heartbeat-resources-2.99.2-6.1 openais-0.80.3-12.2 libopenais2-0.80.3-12.2 What's going wrong here ? any help appreciated.GERD. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] OCF_ERROR_GENERIC
It is OCF_ERR_GENERIC, not OCF_ERROR_GENERIC. Read /usr/lib/ocf/resource.d/heartbeat/.ocf-returncodes You can also use ocf-tester to test your ocf script. Regards Dominik lakshmipadmaja maddali wrote: Hi all, I have a strange issue, that ocf_error_generic is being ingored at times. For example, suppose in the start function of the ocf script if I explicity return ocf_error_generic, the services should shift to the secondary node as of my knowledge, but that's not happening. Instead, heartbeat is calling monitor function again and again. But if I explicity mention exit(1) in the start function, then the heartbeat shifts its services to the secondary node. Same is the case with any function in the ocf script. So, why this is happening!! Please help me out. Regards, padmaja ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems