Re: [Linux-ha-dev] patch: RA conntrackd: Request state info on startup

2012-07-24 Thread Dominik Klein
 currently doing another conntrackd project and therefore using the
 Found a minor issue:

 When the active host is fenced and returns to the cluster, it does not
 request the current connection tracking states. Therefore state
 information might be lost. This patch fixes that. Any comments?

 I'm not sure what do you mean by active host. A node which is
 running conntrackd or a node which is running conntrackd master
 instance?

Erm, yeah. Sorry for not being precise. I mean the node running the
master instance.

 Successfully tested with debian squeeze version 0.9.14.

 Looks OK to me. I'll push it to the repository.

Thanks.

Dominik
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] patch: RA conntrackd: Request state info on startup

2012-07-18 Thread Dominik Klein
Hi people

currently doing another conntrackd project and therefore using the
code once again (jippie :)). Found a minor issue:

When the active host is fenced and returns to the cluster, it does not
request the current connection tracking states. Therefore state
information might be lost. This patch fixes that. Any comments?

Successfully tested with debian squeeze version 0.9.14.

Regards
Dominik


conntrackd.patch
Description: Binary data
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] Antw: Re: Forkbomb not initiating failover

2011-08-29 Thread Dominik Klein
Node level failure is detected on the communications layer, ie hearbeat 
or corosync. That software is run with realtime priority. So it keeps 
working just fine (use tcpdump on the healthy node to verify). So 
pacemaker on the healthy node does now know that the other node has a 
problem and therefore does not initiate failover.

We had this discussion back in 2010, maybe you also want to refer to 
that: 
http://oss.clusterlabs.org/pipermail/pacemaker/2010-February/004739.html

Regards
Dominik

On 07/08/2011 03:23 PM, Warnke, Eric E wrote:

 If the fork bomb is preventing the system from spawning a health check, it
 would seem like the most intelligent course of action would be to presume
 that it failed and act accordingly.

 -Eric


 On 7/8/11 8:38 AM, Lars Marowsky-Breel...@suse.de  wrote:

 On 2011-07-08T14:10:09, Gianluca Cecchigianluca.cec...@gmail.com  wrote:

 So that each node has to write to its dedicated part of it and read
 from the other ones.
 If one node doesn't update its portion it is then detected by the
 others and it is fenced after a configurable number of misses...
 Does pacemaker provide some sort of this configuration?

 external/sbd as a fencing mechanism provides this, but that is not the
 same as a load  system health check at all.

 Though tieing into that would make sense, yes.


 Regards,
 Lars

 --
 Architect Storage/HA, OPS Engineering, Novell, Inc.
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
 Imendörffer, HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Forkbomb not initiating failover

2011-08-29 Thread Dominik Klein
On 08/29/2011 09:51 AM, Dominik Klein wrote:
 Node level failure is detected on the communications layer, ie hearbeat
 or corosync. That software is run with realtime priority. So it keeps
 working just fine (use tcpdump on the healthy node to verify). So
 pacemaker on the healthy node does now know

woops, this was supposed to say not know

 that the other node has a
 problem and therefore does not initiate failover.

 We had this discussion back in 2010, maybe you also want to refer to
 that:
 http://oss.clusterlabs.org/pipermail/pacemaker/2010-February/004739.html

 Regards
 Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] Patch: VirtualDomain - fix probe if config is not on shared storage

2011-06-28 Thread Dominik Klein
 There did not have to be a negative location constraint up to now,
 because the cluster took care of that.
 
 Only because it didn't work correctly.

Okay.

 Actually, this is a wanted setup. It happened that VMs configs were
 changed in ways that lead to a VM not being startable any more. For that
 case, they wanted to be able to start the old config on the other node.

Please, notice _they_ vs. _me_ here :)

 Wow! So, they can have different configurations at different
 nodes.

Agreed, wow!

 The only issue you may have with this cluster is if the
 administrator erronously removes a config on some node, right?
 And that then some time afterwards the cluster does a probe on
 that node. And then again the cluster wants to fail over this VM
 to that node. And that at this point in time no other node can
 run this VM and that it is going to repeatedly try to start and
 fail. And that failed start is fatal isn't configured. No doubt
 that this could happen, but what's the probability? And, finally,
 that doesn't look like a well maintained cluster.

I guess this is something _they_ have to live with then.

At first glance, I honestly thought this was a change in the agent that
introduced a regression that not only this configuration would hit, but
you made me realize that it does not, but that it does improve the agent
for sane setups.

My vote goes for your patch, ie stop  no config = return SUCCESS

Thanks
Dominik
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Patch: VirtualDomain - fix probe if config is not on shared storage

2011-06-27 Thread Dominik Klein
On 06/27/2011 11:09 AM, Dejan Muhamedagic wrote:
 Hi Dominik,
 
 On Fri, Jun 24, 2011 at 03:50:40PM +0200, Dominik Klein wrote:
 Hi Dejan,

 this way, the cluster never learns that it can't start a resource on 
 that node.
 
 This resource depends on shared storage. So, the cluster won't
 try to start it unless the shared storage resource is already
 running. This is something that needs to be specified using
 either a negative preference location constraint or asymmetrical
 cluster. There's no need for yet another mechanism (the extra
 parameter) built into the resource agent. It's really an
 overkill.

As requested on IRC, I describe my setup and explain why I think this is
a regression.

2 node cluster with a bunch of drbd devices.

Each /dev/drbdXX is used as a block device of a VM. The VMs
configuration files are not on shared storage but have to be copied
manually.

So it happened that during configuration of a VM, the admin forgot to
copy the configuration file to node2. The machine's DRBD was configured
though. So the cluster decided to promote the VMs DRBD on node2 and then
start the master-colocated and ordered VM.

With the agent before the mentioned patch, during probe of a newly
configured resource, the cluster would have learned that the VM is not
available on one of the nodes (ERR_INSTALLED), so it would never start
the resource there.

Now it sees NOT_RUNNING on all nodes during probe and may decide to
start the VM on a node where it cannot run. That, with the current
version of the agent, leads to a failed start, a failed stop during
recovery and therefore: an unnecessary stonith operation.

With Dejan's patch, it would still see NOT_RUNNING during probe, but at
least the stop would succeed. So the difference to the old version would
be that we had an unnecessary failed start on the node that does not
have the VM but it would not harm the node and I'd be fine with applying
that patch.

There's a case though that might stop the vm from running (for an amount
of time). And that is if start-failure-is-fatal is false. Then we would
have $migration-threshold failed start/succeeded stop iterations while
the VMs service would not be running.

Of course I do realize that the initial fault is a human one. but the
cluster used to protect from this, does not any more and that's why I
think this is a regression.

I think the correct way to fix this is to still return ERR_INSTALLED
during probe unless the cluster admin configures that the VMs config is
on shared storage. Finding out about resource states on different nodes
is what the probe was designed to do, was it not? And we work around
that in this resource agent just to support certain setups.

Regards
Dominik
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Patch: VirtualDomain - fix probe if config is not on shared storage

2011-06-27 Thread Dominik Klein
 With the agent before the mentioned patch, during probe of a newly
 configured resource, the cluster would have learned that the VM is not
 available on one of the nodes (ERR_INSTALLED), so it would never start
 the resource there.
 
 This is exactly the problem with shared storage setups, where
 such an exit code can prevent resource from ever being started on
 a node which is otherwise perfectly capable of running that
 resource.

I see and understand that that, too, is a valid setup and concern.

 But really, if a resource can _never_ run on a node, then there
 should be a negative location constraint or the cluster should be
 setup as asymmetrical. 

There did not have to be a negative location constraint up to now,
because the cluster took care of that.

 Now, I understand that in your case, it is
 actually due to the administrator's fault.

Yes, that's how I noticed the problem with the agent.

 This particular setup is a special case of shared storage. The
 images are on shared storage, but the configurations are local. I
 think that you really need to make sure that the configurations
 are present where they need to be. Best would be that the
 configuration is kept on the storage along with the corresponding
 VM image. Since you're using a raw device as image, that's
 obviously not possible. Otherwise, use csync2 or similar to keep
 files in sync.

Actually, this is a wanted setup. It happened that VMs configs were
changed in ways that lead to a VM not being startable any more. For that
case, they wanted to be able to start the old config on the other node.

I agree that the cases that lead me to finding this change in the agent
are cases that could have been solved with better configuration and that
your suggestions make sense. Still, I feel that the change introduces a
new way of doing things that might affect running and working setups in
unintended ways. I refuse to believe that I am the only one doing HA VMs
like this (although of course I might be wrong on that, too ...).

Regards
Dominik
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Patch: VirtualDomain - fix probe if config is not on shared storage

2011-06-26 Thread Dominik Klein
I'm not sure my fix is correct.

According to

https://github.com/ClusterLabs/resource-agents/commit/96ff8e9ad3d4beca7e063beef156f3b838a798e1#heartbeat/VirtualDomain

this is a regression which was introduced in April '11.

So the fix should be the other way around: Introduce a parameter that
let's the user configure the config file _is_ on shared storage and if
this is false or unset, return to the old behaviour of returning
ERR_INSTALLED.

Regards
Dominik
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] Patch: VirtualDomain - fix probe if config is not on shared storage

2011-06-24 Thread Dominik Klein
This fixes the issue described yesterday.

Comments?

Regards
Dominik
exporting patch:
# HG changeset patch
# User Dominik Klein dominik.kl...@gmail.com
# Date 1308909599 -7200
# Node ID 2b1615aaca2c90f2f4ab93eb443e5902906fb28a
# Parent  7a11934b142d1daf42a04fbaa0391a3ac47cee4c
RA VirtualDomain: Fix probe if config is not on shared storage

diff -r 7a11934b142d -r 2b1615aaca2c heartbeat/VirtualDomain
--- a/heartbeat/VirtualDomain	Fri Feb 25 12:23:17 2011 +0100
+++ b/heartbeat/VirtualDomain	Fri Jun 24 11:59:59 2011 +0200
@@ -19,9 +19,11 @@
 # Defaults
 OCF_RESKEY_force_stop_default=0
 OCF_RESKEY_hypervisor_default=$(virsh --quiet uri)
+OCF_RESKEY_config_on_shared_storage_default=1
 
 : ${OCF_RESKEY_force_stop=${OCF_RESKEY_force_stop_default}}
 : ${OCF_RESKEY_hypervisor=${OCF_RESKEY_hypervisor_default}}
+: ${OCF_RESKEY_config_on_shared_storage=${OCF_RESKEY_config_on_shared_storage_default}}
 ###
 
 ## I'd very much suggest to make this RA use bash,
@@ -421,8 +423,8 @@
 # check if we can read the config file (otherwise we're unable to
 # deduce $DOMAIN_NAME from it, see below)
 if [ ! -r $OCF_RESKEY_config ]; then
-	if ocf_is_probe; then
-	ocf_log info Configuration file $OCF_RESKEY_config not readable during probe.
+	if ocf_is_probe  ocf_is_true $OCF_RESKEY_config_on_shared_storage; then
+	ocf_log info Configuration file $OCF_RESKEY_config not readable during probe. Assuming it is on shared storage and therefore reporting VM is not running.
 	else
 	ocf_log error Configuration file $OCF_RESKEY_config does not exist or is not readable.
 	return $OCF_ERR_INSTALLED
exporting patch:
# HG changeset patch
# User Dominik Klein dominik.kl...@gmail.com
# Date 1308911272 -7200
# Node ID 312adf2449eb59dcc41686626b1726428d13227b
# Parent  2b1615aaca2c90f2f4ab93eb443e5902906fb28a
RA VirtualDomain: Add metadata for the new parameter

diff -r 2b1615aaca2c -r 312adf2449eb heartbeat/VirtualDomain
--- a/heartbeat/VirtualDomain   Fri Jun 24 11:59:59 2011 +0200
+++ b/heartbeat/VirtualDomain   Fri Jun 24 12:27:52 2011 +0200
@@ -119,6 +119,16 @@
 content type=string default= /
 /parameter
 
+parameter name=config_on_shared_storage unique=0 required=0
+longdesc lang=en
+If your VMs configuration file is _not_ on shared storage, so that the config 
+file not being in place during a probe means that the VM is not 
installed/runnable
+on that node, set this to 0.
+/longdesc
+shortdesc lang=enSet to 0 if your VMs config file is not on shared 
storage/shortdesc
+content type=boolean default=1 /
+/parameter
+
 /parameters
 
 actions
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Patch: VirtualDomain - fix probe if config is not on shared storage

2011-06-24 Thread Dominik Klein
Hi Dejan,

this way, the cluster never learns that it can't start a resource on 
that node.

I don't consider this a solution.

Regards
Dominik
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] VirtualDomain issue

2011-06-22 Thread Dominik Klein
Hi

code snippet from
http://hg.linux-ha.org/agents/raw-file/7a11934b142d/heartbeat/VirtualDomain
(which I believe is the current version)

VirtualDomain_Validate_All() {
snip
 if [ ! -r $OCF_RESKEY_config ]; then
if ocf_is_probe; then
ocf_log info Configuration file $OCF_RESKEY_config not readable
during probe.
else
ocf_log error Configuration file $OCF_RESKEY_config does not exist
or is not readable.
return $OCF_ERR_INSTALLED
fi
 fi
}
snip
VirtualDomain_Validate_All || exit $?
snip
if ocf_is_probe  [ ! -r $OCF_RESKEY_config ]; then
 exit $OCF_NOT_RUNNING
fi

So, say one node does not have the config, but the cluster decides to
run the vm on that node. The probe returns NOT_RUNNING, so the cluster
tries to start the vm, that start returns ERR_INSTALLED, the cluster has
to try to recover from the start failure, so stop it, but that stop op
returns ERR_INSTALLED as well, so we need to be stonith'd.

I think this is wrong behaviour. I read the comments about
configurations being on shared storage which might not be available at
certain points in time and I see the point. But the way this is
implemented clearly does not work for everybody. I vote for making this
configurable. Unfortunately, due to several reasons, I am not able to
contribute this patch myself at the moment.

Regards
Dominik
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] How to start Pacemaker in unmanaged mode ?

2011-05-11 Thread Dominik Klein
Correct me if I'm wrong but I strongly think the following would work:

Since you want to _start_ pacemaker in unmanaged mode, I expect all your
nodes to be offline. Then delete cib.xml on all nodes but one.

On the one remaining, edit cib.xml and put your configuration
is_managed=false there.

Then start all nodes. Don't see why this shouldn't work.

Regards
Dominik

On 05/10/2011 05:08 PM, Alain.Moulle wrote:
 I don't think it is authorized at all, we must never write directly
 in cib.xml , I tried for other needs and it systematically disturbs
 Pacemaker a lot !
 We always have to go through some crm commands or similar 
 (crm_attributes, etc.)
 but they are not taken in account before the 60s are ended.
 
 Alain
 
 Dominik Klein a écrit :
 Just write it to the xml on all nodes?

 On 05/10/2011 01:23 PM, Alain.Moulle wrote:
   
 Sorry I meant directly with is_managed=false of course !
 Alain
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems


   
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 

-- 
IN-telegence GmbH
Oskar-Jäger-Str. 125
50825 Köln

Registergericht AG Köln - HRB 34038
USt-ID DE210882245
Geschäftsführende Gesellschafter: Christian Plätke und Holger Jansen
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How to start Pacemaker in unmanaged mode ?

2011-05-11 Thread Dominik Klein


On 05/11/2011 10:24 AM, Alain.Moulle wrote:
 Hi Dominik,
 I just have tried again :
 service corosync stop on both nodes node1  node2
 remove the cib.xml on node2
 vi cib.xml on node1
 set the property maintenance-mode=true
 (nvpair id=cib-bootstrap-options-maintenance-mode 
 name=maintenance-mode value=true/)
 wq!
 start Pacemaker on node1 and on node2
 On node1 : both nodes remain UNCLEAN(offline) at vitam eternam
 On node2 : crm_mon : Attempting connection to the cluster. at vitam 
 eternam
 Moreover , the added line for maintenance-mode=true has been removed by 
 Pacemaker/corosync in the cib.xml
 
 I tried long time ago to do vi and change some values in the cib.xml, 
 and each time
 I did that, Pacemaker never start correctly again and I had to 
 reconfigure all the
 things from scratch so that it start again...

I have to admit I only did this back in heartbeat times and so it seems
something changed regarding this.

So ... Just had a look at a cluster of mine and there are backup copies
and .sig files for each cib.xml version.

I don't know what pacemaker will do if you just

a) removed the history and .sig file, so only cib.xml is in place
or
b) replaced the (apparently md5) checksum in .sig

Worth a shot I think.
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How to start Pacemaker in unmanaged mode ?

2011-05-10 Thread Dominik Klein
Just write it to the xml on all nodes?

On 05/10/2011 01:23 PM, Alain.Moulle wrote:
 Sorry I meant directly with is_managed=false of course !
 Alain
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ocf:pacemaker:ping: dampen

2011-04-29 Thread Dominik Klein
It waits $dampen before changes are pushed to the cib. So that
eventually occuring icmp hickups do not produce an unintended failover.

At least that's my understanding.

Regards
Dominik

On 04/29/2011 09:22 AM, Ulrich Windl wrote:
 Hi,
 
 I think the description for dampen in OCF:pacemaker:ping 
 (pacemaker-1.1.5-5.5.5 of SLES11 SP1) is too terse:
 
 parameter name=dampen unique=0
 longdesc lang=en
 The time to wait (dampening) further changes occur
 /longdesc
 shortdesc lang=enDampening interval/shortdesc
 content type=integer default=5s/
 /parameter
 
 What does that do?
 
 Regards,
 Ulrich
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ocf:pacemaker:ping: dampen

2011-04-29 Thread Dominik Klein
 correcto

wow.
again! :)
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] New OCF RA: symlink

2011-04-21 Thread Dominik Klein
 Am I too paranoid?

I don't think you are. Some non-root pratically being able to remove any
file is certainly a valid concern.

Thing is: I needed an RA that configured a cronjob. Florian suggested
writing the symlink RA instead, that could manage symlink. Apparently
there was an IRC discussion a couple weeks ago that I was not a part of.

So while the symlink RA could also do what I needed, I tried to write
that instead of the cronjob RA (which will also come since it will cover
some more functions than this one, but that's another story).

So anyway, maybe those involved in the first discussion can comment on
this, too and share thoughts on how to solve things. Maybe they had
already addressed these situations.

Regards
Dominik
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] libglue2 dependency missing in cluster-glue

2011-03-17 Thread Dominik Klein
Mornin Dejan,

 The reason was that libglue2 and cluster-glue were not installed from
 the clusterlabs repository, as the rest of the packages were, but
 instead they were pulled from the original opensuse repository in an
 older version.

 This is what I found in pacemaker.spec.in in the repository:

 Requires(pre):  cluster-glue = 1.0.6

 Which version of glue was that older version?

 0.9.1
 
 Whoa. Can't recall ever seeing that thing.

rpm -qRp cluster-glue-0.9-2.1.x86_64.rpm
/usr/sbin/groupadd
/usr/bin/getent
/usr/sbin/useradd
/bin/sh
rpmlib(PayloadFilesHavePrefix) = 4.0-1
rpmlib(CompressedFileNames) = 3.0.4-1
/bin/bash
/bin/sh
/usr/bin/perl
/usr/bin/python
libc.so.6()(64bit)
libc.so.6(GLIBC_2.2.5)(64bit)
libc.so.6(GLIBC_2.4)(64bit)
libcurl.so.4()(64bit)
libglib-2.0.so.0()(64bit)
liblrm.so.2()(64bit)
libnetsnmp.so.15()(64bit)
libpils.so.2()(64bit)
libplumb.so.2()(64bit)
libplumbgpl.so.2()(64bit)
libstonith.so.1()(64bit)
libxml2.so.2()(64bit)
rpmlib(PayloadIsLzma) = 4.4.6-1

That's the old package, from opensuse.

Here's the new one (106 from clusterlabs' opensuse 11.2 repository):

/usr/sbin/groupadd
/usr/bin/getent
/usr/sbin/useradd
/bin/sh
/bin/sh
/bin/sh
/bin/sh
rpmlib(PayloadFilesHavePrefix) = 4.0-1
rpmlib(CompressedFileNames) = 3.0.4-1
/bin/bash
/bin/sh
/usr/bin/env
/usr/bin/perl
/usr/bin/python
libOpenIPMI.so.0()(64bit)
libOpenIPMIposix.so.0()(64bit)
libOpenIPMIutils.so.0()(64bit)
libbz2.so.1()(64bit)
libc.so.6()(64bit)
libc.so.6(GLIBC_2.2.5)(64bit)
libc.so.6(GLIBC_2.4)(64bit)
libcrypto.so.0.9.8()(64bit)
libcurl.so.4()(64bit)
libdl.so.2()(64bit)
libglib-2.0.so.0()(64bit)
liblrm.so.2()(64bit)
libltdl.so.7()(64bit)
libm.so.6()(64bit)
libnetsnmp.so.15()(64bit)
libopenhpi.so.2()(64bit)
libpils.so.2()(64bit)
libplumb.so.2()(64bit)
libplumbgpl.so.2()(64bit)
librt.so.1()(64bit)
libstonith.so.1()(64bit)
libuuid.so.1()(64bit)
libxml2.so.2()(64bit)
libz.so.1()(64bit)
rpmlib(PayloadIsLzma) = 4.4.6-1

I don't see libglue there.

Regards
Dominik
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] libglue2 dependency missing in cluster-glue

2011-03-17 Thread Dominik Klein
 This is what I found in pacemaker.spec.in in the repository:
 
 Requires(pre):  cluster-glue = 1.0.6

The 1.0.10 rpm from clusterlabs for opensuse 11.2 just says
cluster-glue afaict:

rpm -qR pacemaker
cluster-glue
resource-agents
python = 2.4
libpacemaker3 = 1.0.10-1.4
libesmtp
net-snmp
rpmlib(PayloadFilesHavePrefix) = 4.0-1
rpmlib(CompressedFileNames) = 3.0.4-1
/bin/bash
/bin/sh
/usr/bin/env
/usr/bin/python
libbz2.so.1()(64bit)
libc.so.6()(64bit)
libc.so.6(GLIBC_2.2.5)(64bit)
libc.so.6(GLIBC_2.3)(64bit)
libc.so.6(GLIBC_2.4)(64bit)
libccmclient.so.1()(64bit)
libcib.so.1()(64bit)
libcoroipcc.so.4()(64bit)
libcrmcluster.so.1()(64bit)
libcrmcommon.so.2()(64bit)
libcrypt.so.1()(64bit)
libcrypto.so.0.9.8()(64bit)
libdl.so.2()(64bit)
libesmtp.so.5()(64bit)
libgcrypt.so.11()(64bit)
libglib-2.0.so.0()(64bit)
libgnutls.so.26()(64bit)
libgnutls.so.26(GNUTLS_1_4)(64bit)
libgpg-error.so.0()(64bit)
libhbclient.so.1()(64bit)
liblrm.so.2()(64bit)
libltdl.so.7()(64bit)
libm.so.6()(64bit)
libncurses.so.5()(64bit)
libnetsnmp.so.15()(64bit)
libnetsnmpagent.so.15()(64bit)
libnetsnmphelpers.so.15()(64bit)
libnetsnmpmibs.so.15()(64bit)
libpam.so.0()(64bit)
libpam.so.0(LIBPAM_1.0)(64bit)
libpe_rules.so.2()(64bit)
libpe_status.so.2()(64bit)
libpengine.so.3()(64bit)
libperl.so()(64bit)
libpils.so.2()(64bit)
libplumb.so.2()(64bit)
libpopt.so.0()(64bit)
libpthread.so.0()(64bit)
libpthread.so.0(GLIBC_2.2.5)(64bit)
librpm.so.0()(64bit)
librpmio.so.0()(64bit)
librt.so.1()(64bit)
libsensors.so.3()(64bit)
libstonith.so.1()(64bit)
libstonithd.so.0()(64bit)
libtransitioner.so.1()(64bit)
libwrap.so.0()(64bit)
libxml2.so.2()(64bit)
libxslt.so.1()(64bit)
libz.so.1()(64bit)
rpmlib(PayloadIsLzma) = 4.4.6-1

Regards
Dominik
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] libglue2 dependency missing in cluster-glue

2011-03-16 Thread Dominik Klein
Hi

as some of you might have seen on the pacemaker list, I tried to install
a 3 node cluster and there were ipc issues reported by the cib and
therefore the cluster could not start correctly.

The reason was that libglue2 and cluster-glue were not installed from
the clusterlabs repository, as the rest of the packages were, but
instead they were pulled from the original opensuse repository in an
older version.

So I went and updated cluster-glue with the version from the clusterlabs
repository. Nothing changed though.

rpm -qa|grep glue
revealed that libglue2 was still the old version while cluster-glue was
updated.

Looking at the package dependencies, I think the problem is that
cluster-glue does not depend on package libglue2 (while they do the
other way around).

So one error, which I could improve, was that the installation
instructions on the clusterlabs site did not mention libglue2 and
cluster-glue. They do now which should prevent others who follow those
instructions.

The dependency thing is up for grabs ;)

Regards
Dominik
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] libglue2 dependency missing in cluster-glue

2011-03-16 Thread Dominik Klein
Hi Dejan

 The reason was that libglue2 and cluster-glue were not installed from
 the clusterlabs repository, as the rest of the packages were, but
 instead they were pulled from the original opensuse repository in an
 older version.
 
 This is what I found in pacemaker.spec.in in the repository:
 
 Requires(pre):  cluster-glue = 1.0.6
 
 Which version of glue was that older version?

0.9.1

So you're saying pacemaker depends on cluster-glue 1.0.6. Well, that was
not installed when I installed pacemaker. And I did not use --nodeps or
such thing.

Instead, that old version was installed from the original opensuse
repositories.

 So I went and updated cluster-glue with the version from the clusterlabs
 repository. Nothing changed though.

 rpm -qa|grep glue
 revealed that libglue2 was still the old version while cluster-glue was
 updated.

 Looking at the package dependencies, I think the problem is that
 cluster-glue does not depend on package libglue2 (while they do the
 other way around).
 
 Yes, I guess that that should be fixed.

Regards
Dominik
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] stonith + APC Masterswitch (AP9225 + AP9616)

2011-02-25 Thread Dominik Klein
You could also try apcmastersnmp.

Got that to work with apc devices which did not work with the telnet thing.

As long as they didn't change mibs (which I don't know whether they
have). Might be worth a shot.

Regards
Dominik

On 02/25/2011 02:24 AM, Avestan wrote:
 
 Hello Dejan,
 
 As I am trying to use the patch that you pointed, I have got the following: 
 
 [root@server1 ~]# patch  /usr/src/apcmaster.c.patch
 can't find file to patch at input line 3
 Perhaps you should have used the -p or --strip option?
 The text leading up to this was:
 --
 |--- apcmaster_orig.c   2007-12-21 09:32:27.0 -0600
 |+++ apcmaster.c2008-02-19 18:24:48.0 -0600
 --
 File to patch:
 [root@server1~]# 
 
 Could you tell me which file I am trying to patch?  The closet file that I
 can find which may need patching is:
 
 /usr/lib/stonith/plugins/stonith2/apcmaster.so
 
 Thank you in advance for your help, Dejan.
 
 Avestan
 
 
 
 
 Dejan Muhamedagic wrote:

 Hi,

 On Wed, Feb 23, 2011 at 06:44:02PM -0800, Avestan wrote:

 Hello everyone,

 I have been trying to get STONITH work with the MasterSwitch Plus (AP9225
 +
 AP9617) with very little success.

 I have to mention that I am able to access the MasterSwitch Plus via
 either
 serial port or Ethernet port with no issue.  But when I try the stonith
 command, it appears that there is some issue with the login.

 [root@server1 ~]# stonith -d 1 -t apcmaster -p 192.168.1.50 apc apc -l
 ** (process:4930): DEBUG: NewPILPluginUniv(0xa03b008)
 ** (process:4930): DEBUG: PILS: Plugin path =
 /usr/lib/stonith/plugins:/usr/lib/pils/plugins
 ** (process:4930): DEBUG: NewPILInterfaceUniv(0xa03c910)
 ** (process:4930): DEBUG: NewPILPlugintype(0xa03ca68)
 ** (process:4930): DEBUG: NewPILPlugin(0xa03cdc8)
 ** (process:4930): DEBUG: NewPILInterface(0xa03cb00)
 ** (process:4930): DEBUG:
 NewPILInterface(0xa03cb00:InterfaceMgr/InterfaceMgr)*** user_data: 0x0
 ***
 ** (process:4930): DEBUG:
 InterfaceManager_plugin_init(0xa03cb00/InterfaceMgr)
 ** (process:4930): DEBUG: Registering Implementation manager for
 Interface
 type 'InterfaceMgr'
 ** (process:4930): DEBUG: PILS: Looking for InterfaceMgr/generic =
 [/usr/lib/stonith/plugins/InterfaceMgr/generic.so]
 ** (process:4930): DEBUG: Plugin file
 /usr/lib/stonith/plugins/InterfaceMgr/generic.so does not exist
 ** (process:4930): DEBUG: PILS: Looking for InterfaceMgr/generic =
 [/usr/lib/pils/plugins/InterfaceMgr/generic.so]
 ** (process:4930): DEBUG: Plugin path for InterfaceMgr/generic =
 [/usr/lib/pils/plugins/InterfaceMgr/generic.so]
 ** (process:4930): DEBUG: PluginType InterfaceMgr already present
 ** (process:4930): DEBUG: Plugin InterfaceMgr/generic  init function:
 InterfaceMgr_LTX_generic_pil_plugin_init
 ** (process:4930): DEBUG: NewPILPlugin(0xa03d690)
 ** (process:4930): DEBUG: Plugin InterfaceMgr/generic loaded and
 constructed.
 ** (process:4930): DEBUG: Calling init function in plugin
 InterfaceMgr/generic.
 ** (process:4930): DEBUG: NewPILInterface(0xa03d630)
 ** (process:4930): DEBUG:
 NewPILInterface(0xa03d630:InterfaceMgr/stonith2)*** user_data: 0xa03ce80
 ***
 ** (process:4930): DEBUG: Registering Implementation manager for
 Interface
 type 'stonith2'
 ** (process:4930): DEBUG: IfIncrRefCount(1 + 1 )
 ** (process:4930): DEBUG: PluginIncrRefCount(0 + 1 )
 ** (process:4930): DEBUG: IfIncrRefCount(1 + 100 )
 ** (process:4930): DEBUG: PILS: Looking for stonith2/apcmaster =
 [/usr/lib/stonith/plugins/stonith2/apcmaster.so]
 ** (process:4930): DEBUG: Plugin path for stonith2/apcmaster =
 [/usr/lib/stonith/plugins/stonith2/apcmaster.so]
 ** (process:4930): DEBUG: Creating PluginType for stonith2
 ** (process:4930): DEBUG: NewPILPlugintype(0xa03db70)
 ** (process:4930): DEBUG: Plugin stonith2/apcmaster  init function:
 stonith2_LTX_apcmaster_pil_plugin_init
 ** (process:4930): DEBUG: NewPILPlugin(0xa03dcf0)
 ** (process:4930): DEBUG: Plugin stonith2/apcmaster loaded and
 constructed.
 ** (process:4930): DEBUG: Calling init function in plugin
 stonith2/apcmaster.
 ** (process:4930): DEBUG: NewPILInterface(0xa03dc38)
 ** (process:4930): DEBUG:
 NewPILInterface(0xa03dc38:stonith2/apcmaster)***
 user_data: 0x26da2c ***
 ** (process:4930): DEBUG: IfIncrRefCount(101 + 1 )
 ** (process:4930): DEBUG: PluginIncrRefCount(0 + 1 )
 ** (process:4930): DEBUG: Got '\xff'
 ** (process:4930): DEBUG: Got '\xfb'
 ** (process:4930): DEBUG: Got '\u0001'
 ** (process:4930): DEBUG: Got '
 '
 ** (process:4930): DEBUG: Got '\u000d'
 ** (process:4930): DEBUG: Got 'U'
 ** (process:4930): DEBUG: Got 's'
 ** (process:4930): DEBUG: Got 'e'
 ** (process:4930): DEBUG: Got 'r'
 ** (process:4930): DEBUG: Got ' '
 ** (process:4930): DEBUG: Got 'N'
 ** (process:4930): DEBUG: Got 'a'
 ** (process:4930): DEBUG: Got 'm'
 ** (process:4930): DEBUG: Got 'e'
 ** (process:4930): DEBUG: Got ' '
 ** (process:4930): DEBUG: Got ':'
 ** (process:4930): DEBUG: Got ' '

 ** (process:4930): CRITICAL **: 

Re: [Linux-ha-dev] Feedback on conntrackd RA by Dominik Klein

2011-02-14 Thread Dominik Klein
Thanks for inclusion.

While looking through the pushed changes, I spotted two meta-data typos.
See trivial patch.

Regards
Dominik

 Applied and pushed with two minor edits. Thanks a lot!
 
 Cheers,
 Florian
--- conntrackd.orig	2011-02-14 11:43:22.0 +0100
+++ conntrackd	2011-02-14 11:43:42.0 +0100
@@ -57,7 +57,7 @@
 longdesc lang=enName of the conntrackd executable.
 If conntrackd is installed and available in the default PATH, it is sufficient to configure the name of the binary
 For example my-conntrackd-binary-version-0.9.14
-If conntrackd is installed somehwere else, you may also give a full path
+If conntrackd is installed somewhere else, you may also give a full path
 For example /packages/conntrackd-0.9.14/sbin/conntrackd
 /longdesc
 shortdesc lang=enName of the conntrackd executable/shortdesc
@@ -66,7 +66,7 @@
 
 parameter name=config
 longdesc lang=enFull path to the conntrackd.conf file.
-For example /packages/conntrackd-0.9.4/etc/conntrackd/conntrackd.conf/longdesc
+For example /packages/conntrackd-0.9.14/etc/conntrackd/conntrackd.conf/longdesc
 shortdesc lang=enPath to conntrackd.conf/shortdesc
 content type=string default=$OCF_RESKEY_config_default/
 /parameter
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Feedback on conntrackd RA by Dominik Klein

2011-02-11 Thread Dominik Klein
Hi Florian

 it appears that the RA is good to be merged with just a few changes left
 to be done.

Great!

 * Please fix the initialization to honor $OCF_FUNCTIONS_DIR and ditch
 the redundant locale initialization.

done

 * Please rename the parameters to follow the precendents set by other
 RAs (binary instead of conntrackd, config instead of
 conntrackdconf).

done

 * Please don't require people to set a full path to the conntrackd
 binary, honoring $PATH is expected.

I don't see where I do that. At least code-wise I never did that. Did
you mean the meta-data?

 * Please set defaults the way the other RAs do, rather than with your
 if [ -z OCF_RESKEY_whatever ] logic.

done

 * Please define the default path to your statefile in relative to
 ${HA_RSCTMP}. Also, put ${OCF_RESOURCE_INSTANCE} in the filename.

done

 * Actually, rather than managing your statefile manually, you might be
 able to just use ha_pseudo_resource().

done
nice function btw :)

 * Please revise your timeouts. Is a 240-second minimum timeout on start
 not a bit excessive?

Sure is. Copy and paste leftover. Changed to 30.

 * Please revise your metadata, specifically your longdescs. The more
 useful information you provide to users, the better. Recall that that
 information is readily available to users via the man pages and crm ra
 info.

done

Regards
Dominik
--- conntrackd	2011-02-10 12:23:37.054678924 +0100
+++ conntrackd.fghaas	2011-02-11 09:45:39.721300359 +0100
@@ -4,7 +4,7 @@
 #   An OCF RA for conntrackd
 #	http://conntrack-tools.netfilter.org/
 #
-# Copyright (c) 2010 Dominik Klein
+# Copyright (c) 2011 Dominik Klein
 #
 # This program is free software; you can redistribute it and/or modify
 # it under the terms of version 2 of the GNU General Public License as
@@ -25,11 +25,19 @@
 # along with this program; if not, write the Free Software Foundation,
 # Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
 #
+
 ###
 # Initialization:
 
-. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs
-export LANG=C LANGUAGE=C LC_ALL=C
+: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat}
+. ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs
+
+###
+
+OCF_RESKEY_binary_default=/usr/sbin/conntrackd
+OCF_RESKEY_config_default=/etc/conntrackd/conntrackd.conf
+: ${OCF_RESKEY_binary=${OCF_RESKEY_binary_default}}
+: ${OCF_RESKEY_config=${OCF_RESKEY_config_default}}
 
 meta_data() {
 	cat END
@@ -46,30 +54,30 @@
 
 parameters
 parameter name=conntrackd
-longdesc lang=enFull path to conntrackd executable/longdesc
-shortdesc lang=enFull path to conntrackd executable/shortdesc
-content type=string default=/usr/sbin/conntrackd/
+longdesc lang=enName of the conntrackd executable.
+If conntrackd is installed and available in the default PATH, it is sufficient to configure the name of the binary
+For example my-conntrackd-binary-version-0.9.14
+If conntrackd is installed somehwere else, you may also give a full path
+For example /packages/conntrackd-0.9.14/sbin/conntrackd
+/longdesc
+shortdesc lang=enName of the conntrackd executable/shortdesc
+content type=string default=$OCF_RESKEY_binary_default/
 /parameter
 
-parameter name=conntrackdconf
-longdesc lang=enFull path to the conntrackd.conf file./longdesc
+parameter name=config
+longdesc lang=enFull path to the conntrackd.conf file.
+For example /packages/conntrackd-0.9.4/etc/conntrackd/conntrackd.conf/longdesc
 shortdesc lang=enPath to conntrackd.conf/shortdesc
-content type=string default=/etc/conntrackd/conntrackd.conf/
-/parameter
-
-parameter name=statefile
-longdesc lang=enFull path to the state file you wish to use./longdesc
-shortdesc lang=enFull path to the state file you wish to use./shortdesc
-content type=string default=/var/run/conntrackd.master/
+content type=string default=$OCF_RESKEY_config_default/
 /parameter
 /parameters
 
 actions
-action name=start   timeout=240 /
-action name=promote	 timeout=90 /
-action name=demote	timeout=90 /
-action name=notify	timeout=90 /
-action name=stoptimeout=100 /
+action name=start   timeout=30 /
+action name=promote	 timeout=30 /
+action name=demote	timeout=30 /
+action name=notify	timeout=30 /
+action name=stoptimeout=30 /
 action name=monitor depth=0  timeout=20 interval=20 role=Slave /
 action name=monitor depth=0  timeout=20 interval=10 role=Master /
 action name=meta-data  timeout=5 /
@@ -94,11 +102,7 @@
 conntrackd_is_master() {
 	# You can't query conntrackd whether it is master or slave. It can be both at the same time. 
 	# This RA creates a statefile during promote and enforces master-max=1 and clone-node-max=1
-	if [ -e $STATEFILE ]; then
-		return $OCF_SUCCESS
-	else
-		return $OCF_ERR_GENERIC
-	fi
+	ha_pseudo_resource $statefile monitor
 }
 
 conntrackd_set_master_score() {
@@ -108,11 +112,11 @@
 conntrackd_monitor() {
 	rc=$OCF_NOT_RUNNING
 	# It does not write a PID file, so check

Re: [Linux-ha-dev] Feedback on conntrackd RA by Dominik Klein

2011-02-11 Thread Dominik Klein
Maybe you applied the s/100/$slavescore patch someone sent a couple
weeks ago. I used the last version from thread New stateful RA:
conntrackd dated october 27th 3:29pm.

Anyway, here's my version.

Regards
Dominik

On 02/11/2011 01:36 PM, Florian Haas wrote:
 On 2011-02-11 09:48, Dominik Klein wrote:
 Hi Florian

 it appears that the RA is good to be merged with just a few changes left
 to be done.

 Great!

 [lots of exemplary role-model patch modifications]

 Regards
 Dominik
 
 Thanks! For some reason the patch does not apply in my checkout. Can you
 just send me your version? I'll figure it out then.
 
 Cheers,
 Florian
#!/bin/bash
#
#
#   An OCF RA for conntrackd
#   http://conntrack-tools.netfilter.org/
#
# Copyright (c) 2011 Dominik Klein
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like.  Any license provided herein, whether implied or
# otherwise, applies only to this software file.  Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#

###
# Initialization:

: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat}
. ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs

###

OCF_RESKEY_binary_default=/usr/sbin/conntrackd
OCF_RESKEY_config_default=/etc/conntrackd/conntrackd.conf
: ${OCF_RESKEY_binary=${OCF_RESKEY_binary_default}}
: ${OCF_RESKEY_config=${OCF_RESKEY_config_default}}

meta_data() {
cat END
?xml version=1.0?
!DOCTYPE resource-agent SYSTEM ra-api-1.dtd
resource-agent name=conntrackd
version1.1/version

longdesc lang=en
Master/Slave OCF Resource Agent for conntrackd
/longdesc

shortdesc lang=enThis resource agent manages conntrackd/shortdesc

parameters
parameter name=conntrackd
longdesc lang=enName of the conntrackd executable.
If conntrackd is installed and available in the default PATH, it is sufficient 
to configure the name of the binary
For example my-conntrackd-binary-version-0.9.14
If conntrackd is installed somehwere else, you may also give a full path
For example /packages/conntrackd-0.9.14/sbin/conntrackd
/longdesc
shortdesc lang=enName of the conntrackd executable/shortdesc
content type=string default=$OCF_RESKEY_binary_default/
/parameter

parameter name=config
longdesc lang=enFull path to the conntrackd.conf file.
For example 
/packages/conntrackd-0.9.4/etc/conntrackd/conntrackd.conf/longdesc
shortdesc lang=enPath to conntrackd.conf/shortdesc
content type=string default=$OCF_RESKEY_config_default/
/parameter
/parameters

actions
action name=start   timeout=30 /
action name=promote   timeout=30 /
action name=demote   timeout=30 /
action name=notify   timeout=30 /
action name=stoptimeout=30 /
action name=monitor depth=0  timeout=20 interval=20 role=Slave /
action name=monitor depth=0  timeout=20 interval=10 role=Master /
action name=meta-data  timeout=5 /
action name=validate-all  timeout=30 /
/actions
/resource-agent
END
}

meta_expect()
{
local what=$1 whatvar=OCF_RESKEY_CRM_meta_${1//-/_} op=$2 expect=$3
local val=${!whatvar}
if [[ -n $val ]]; then
# [, not [[, or it won't work ;)
[ $val $op $expect ]  return
fi
ocf_log err meta parameter misconfigured, expected $what $op $expect, 
but found ${val:-unset}.
exit $OCF_ERR_CONFIGURED
}

conntrackd_is_master() {
# You can't query conntrackd whether it is master or slave. It can be 
both at the same time. 
# This RA creates a statefile during promote and enforces master-max=1 
and clone-node-max=1
ha_pseudo_resource $statefile monitor
}

conntrackd_set_master_score() {
${HA_SBIN_DIR}/crm_master -Q -l reboot -v $1
}

conntrackd_monitor() {
rc=$OCF_NOT_RUNNING
# It does not write a PID file, so check with pgrep
pgrep -f $OCF_RESKEY_binary  rc=$OCF_SUCCESS
if [ $rc -eq $OCF_SUCCESS ]; then
# conntrackd is running 
# now see if it acceppts queries
if ! $OCF_RESKEY_binary -C $OCF_RESKEY_config -s  /dev/null 
21; then
rc=$OCF_ERR_GENERIC
ocf_log err conntrackd is running but not responding

Re: [Linux-ha-dev] Feedback on conntrackd RA by Dominik Klein

2011-02-08 Thread Dominik Klein
Not yet. That's why I wrote soon_-ish_ ;)

Any release coming up you want to include this in?

 any news on this?
 
 Cheers,
 Florian
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Report on conntrackd RA

2011-01-31 Thread Dominik Klein
Hi

thanks for testing and feedback.

On 01/27/2011 01:37 PM, Marjan, BlatnikČŠŽ wrote:
 Conntrackd RA from Dominik Klein works. We can now successfully 
 migrate/fail from one node to another one.
 
 At the begining, we have problems with failing. After reboot/fail, the 
 slave was not synced with master.  After some debuging I found, that the 
 conntrackd must not be started at boot time, but only by pacemaker. 

Like any other program managed by the cluster.

Regards
Dominik

 My 
 mistake. After disabling conntrackd boot script, failing works perfectly.
 
 If conntrackd on slave is started by init script, then master does not 
 issue
   conntrackd notify with
   OCF_RESKEY_CRM_meta_notify_type=post and
   OCF_RESKEY_CRM_meta_notify_operation=start
 and does not send a bulk update to the slave.
 Master does issue conntrackd notify with 
 OCF_RESKEY_CRM_meta_notify_type set  to pre, but since conntrackd on 
 slave is running, there is no post phase, which send a bulk update to 
 the slave.
 
 OCF_RESKEY_CRM_meta_notify_type may be ignored and send bulk update two 
 times, but it's better to control conntrackd only by pacemaker.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Feedback on conntrackd RA by Dominik Klein

2011-01-31 Thread Dominik Klein
Just now found this thread. I will include the suggested changes and
post the new RA soon-ish.

Dominik

On 01/21/2011 08:26 AM, Florian Haas wrote:
 On 01/18/2011 04:21 PM, Florian Haas wrote:
 Our site will shortly be deploying a new HA firewall based on Linux,
 iptables, pacemaker and conntrackd.
 conntrackd[1] is used to maintain connection state of active
 connections
 across the two firewalls allowing us to failover from one firewall to
 the other without dropping any connections.

 In order to achieve this with pacemaker we needed to find a resource
 agent for conntrackd. Looking at the mailing list we found a couple of
 options although we only fully evaluated the RA produced by Dominik
 Klein as it appears to be more feature complete than the alternative.
 For a full description of his RA please see his original thread[2].

 So far throughout testing we have been very pleased with it. We can
 successfully fail between our nodes and the RA correctly handles the
 synchronisation steps required in the background.
 
 Dominik,
 
 it appears that the RA is good to be merged with just a few changes left
 to be done.
 
 * Please fix the initialization to honor $OCF_FUNCTIONS_DIR and ditch
 the redundant locale initialization.
 
 * Please rename the parameters to follow the precendents set by other
 RAs (binary instead of conntrackd, config instead of
 conntrackdconf).
 
 * Please don't require people to set a full path to the conntrackd
 binary, honoring $PATH is expected.
 
 * Please set defaults the way the other RAs do, rather than with your
 if [ -z OCF_RESKEY_whatever ] logic.
 
 * Please define the default path to your statefile in relative to
 ${HA_RSCTMP}. Also, put ${OCF_RESOURCE_INSTANCE} in the filename.
 
 * Actually, rather than managing your statefile manually, you might be
 able to just use ha_pseudo_resource().
 
 * Please revise your timeouts. Is a 240-second minimum timeout on start
 not a bit excessive?
 
 * Please revise your metadata, specifically your longdescs. The more
 useful information you provide to users, the better. Recall that that
 information is readily available to users via the man pages and crm ra
 info.
 
 Thanks!
 Cheers,
 Florian
 
 


-- 
IN-telegence GmbH
Oskar-Jäger-Str. 125
50825 Köln

Registergericht AG Köln - HRB 34038
USt-ID DE210882245
Geschäftsführende Gesellschafter: Christian Plätke und Holger Jansen
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Feedback on conntrackd RA by Dominik Klein

2011-01-31 Thread Dominik Klein
 Or, put differently: is us tracking the supposed state really necessary,
 or can we inquire it from the service somehow?
 
 From the submitted RA:
 
 # You can't query conntrackd whether it is master or slave. It can 
 be both at the same time. 
 # This RA creates a statefile during promote and enforces 
 master-max=1 and clone-node-max=1
 
 Knowing Dominik I think it's safe to assume he's done his homework on
 this, and hasn't put in this comment without careful consideration.

If I knew a way to query the state, believe me, I would use it. I
totally understand this seems ugly the way it is and I agree 100%.

However, having a master/slave RA is what the cluster needs imho to
fully support conntrackd. Encouraging people to start conntrackd by init
and then have the RA just execute commands for state-shipping seemed and
seems odd to me (that's what the first RA did).

 But
 I'm sure he won't mind if you manage to convince him otherwise.

Sure I won't. Maybe a newer version (if exists) includes this. I'll have
another look.

Regards
Dominik
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] New stateful RA: conntrackd

2010-10-27 Thread Dominik Klein
Hi everybody

So I updated my RA according to Florian's comments on Jonathan
Petersson's conntrackd RA. I also contacted him in order to merge our
RAs, no reply there yet. Once we talked, you will get an update by one
of us.

Regards
Dominik
#!/bin/bash
#
#
#   An OCF RA for conntrackd
#   http://conntrack-tools.netfilter.org/
#
# Copyright (c) 2010 Dominik Klein
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like.  Any license provided herein, whether implied or
# otherwise, applies only to this software file.  Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#
###
# Initialization:

. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs
export LANG=C LANGUAGE=C LC_ALL=C

meta_data() {
cat END
?xml version=1.0?
!DOCTYPE resource-agent SYSTEM ra-api-1.dtd
resource-agent name=conntrackd
version1.1/version

longdesc lang=en
Master/Slave OCF Resource Agent for conntrackd
/longdesc

shortdesc lang=enThis resource agent manages conntrackd/shortdesc

parameters
parameter name=conntrackd
longdesc lang=enFull path to conntrackd executable/longdesc
shortdesc lang=enFull path to conntrackd executable/shortdesc
content type=string default=/usr/sbin/conntrackd/
/parameter

parameter name=conntrackdconf
longdesc lang=enFull path to the conntrackd.conf file./longdesc
shortdesc lang=enPath to conntrackd.conf/shortdesc
content type=string default=/etc/conntrackd/conntrackd.conf/
/parameter

parameter name=statefile
longdesc lang=enFull path to the state file you wish to use./longdesc
shortdesc lang=enFull path to the state file you wish to use./shortdesc
content type=string default=/var/run/conntrackd.master/
/parameter
/parameters

actions
action name=start   timeout=240 /
action name=promote   timeout=90 /
action name=demote   timeout=90 /
action name=notify   timeout=90 /
action name=stoptimeout=100 /
action name=monitor depth=0  timeout=20 interval=20 role=Slave /
action name=monitor depth=0  timeout=20 interval=10 role=Master /
action name=meta-data  timeout=5 /
action name=validate-all  timeout=30 /
/actions
/resource-agent
END
}

meta_expect()
{
local what=$1 whatvar=OCF_RESKEY_CRM_meta_${1//-/_} op=$2 expect=$3
local val=${!whatvar}
if [[ -n $val ]]; then
# [, not [[, or it won't work ;)
[ $val $op $expect ]  return
fi
ocf_log err meta parameter misconfigured, expected $what $op $expect, 
but found ${val:-unset}.
exit $OCF_ERR_CONFIGURED
}

conntrackd_is_master() {
# You can't query conntrackd whether it is master or slave. It can be 
both at the same time. 
# This RA creates a statefile during promote and enforces master-max=1 
and clone-node-max=1
if [ -e $STATEFILE ]; then
return $OCF_SUCCESS
else
return $OCF_ERR_GENERIC
fi
}

conntrackd_set_master_score() {
${HA_SBIN_DIR}/crm_master -Q -l reboot -v $1
}

conntrackd_monitor() {
rc=$OCF_NOT_RUNNING
# It does not write a PID file, so check with pgrep
pgrep -f $CONNTRACKD  rc=$OCF_SUCCESS
if [ $rc = $OCF_SUCCESS ]; then
# conntrackd is running 
# now see if it acceppts queries
if ! ($CONNTRACKD -C $CONNTRACKD_CONF -s  /dev/null 21); then
rc=$OCF_ERR_GENERIC
ocf_log err conntrackd is running but not responding 
to queries
fi
if conntrackd_is_master; then
rc=$OCF_RUNNING_MASTER
# Restore master setting on probes
if [ $OCF_RESKEY_CRM_meta_interval -eq 0 ]; then
conntrackd_set_master_score $master_score
fi
else
# Restore master setting on probes
if [ $OCF_RESKEY_CRM_meta_interval -eq 0 ]; then
conntrackd_set_master_score $slave_score
fi
fi
fi
return $rc
}

conntrackd_start() {
rc=$OCF_ERR_GENERIC

# Keep

[Linux-ha-dev] New stateful RA: conntrackd

2010-10-15 Thread Dominik Klein
Hi everybody,

I wrote a master/slave RA to manage conntrackd, the connection tracking
daemon from the netfilter project. Conntrackd is used to replicate
connection state between highly available stateful firewalls.

Conntrackd replicates data using multicast. Basically it sends state
information about connections written to its kernels connection tracking
table. Replication slaves write these updates to an external cache.

When a firewall is to take over the master role, it commits the external
cache to the kernel and so knows the connections that were previously
running through the old master system and clients can continue working
without having to open a new connection.

While there has been an RA for conntrackd (at least I found something
that looked like it in a pastebin using google), this one was not able
to deal with failback, which is a thing I needed, and was not yet
included in the repository. I hope this one will be included.

The main challenge in this RA was the failback part. Say one system goes
down completely. Then it loses the kernel connection tracking table and
the external cache. Once it comes back, it will receive updates for new
connections that are initiated through the master, but it will neither
be sent the complete tracking table of the current master, nor can it
request this (that's how I understand and tested conntrackd works,
please correct me if I'm wrong :)).

This may be acceptable for short-lived connections and configurations
where there is no preferred master system, but it does become a problem
if you have either of those.

So my approach is to send a so called bulk update in two situations:

a) in the notify pre promote call, if the local machine is not the
machine to be promoted
This part is responsible for sending the update to a preferred master
that had previously failed (failback).
b) in the notify post start call, if the local machine is the master
This part is responsible for sending the update to a previously failed
machine that re-joins the cluster but is not to be promoted right away.

For now I limited the RA to deal with only 2 clones and 1 master since
this is the only testbed I have and I am not 100% sure what happens to
the new master in situation a) if there are multiple slaves.

Configuration could look like this, notify=true is important:

primitive conntrackd ocf:intelegence:conntrackd \
op monitor interval=10 timeout=10 \
op monitor interval=11 role=Master timeout=10
primitive ip-extern ocf:heartbeat:IPaddr2 \
params ip=10.2.50.237 cidr_netmask=24 \
op monitor interval=10 timeout=10
primitive ip-intern ocf:heartbeat:IPaddr2 \
params ip=10.2.52.3 cidr_netmask=24 \
op monitor interval=10 timeout=10
ms ms-conntrackd conntrackd \
meta target-role=Started globally-unique=false notify=true
colocation ip-intern-extern inf: ip-extern ip-intern
colocation ips-on-conntrackd-master inf: ip-intern ms-conntrackd:Master
order ips-after-conntrackd inf: ms-conntrackd:promote ip-intern:start

Please review and test the RA, post comments and questions. Maybe it can
be included in the repository.

Regards
Dominik

ps. yes, some parts are from linbit's drbd RA and some parts may also be
from Andrew's Stateful RA. Hope that's okay.

-- 
IN-telegence GmbH
Oskar-Jäger-Str. 125
50825 Köln

Registergericht AG Köln - HRB 34038
USt-ID DE210882245
Geschäftsführende Gesellschafter: Christian Plätke und Holger Jansen
#!/bin/bash
#
#
#   An OCF RA for conntrackd
#   http://conntrack-tools.netfilter.org/
#
# Copyright (c) 2010 Dominik Klein
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like.  Any license provided herein, whether implied or
# otherwise, applies only to this software file.  Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#
###
# Initialization:

. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs
export LANG=C LANGUAGE=C LC_ALL=C

meta_data() {
cat END
?xml version=1.0?
!DOCTYPE resource-agent SYSTEM ra-api-1.dtd
resource-agent name=conntrackd
version1.1/version

longdesc lang=en
Master/Slave OCF Resource Agent for conntrackd
/longdesc

shortdesc lang

Re: [Linux-HA] Need suggestion on STONITH device

2010-04-07 Thread Dominik Klein
On 04/01/2010 08:54 PM, Tony Gan wrote:
 Thank you for your reply, Dominik.
 I think UPS or PDU in this case is a better solution than a lights-out
 device, since they have separate power supply.

They do?

They _are_ the power supply for the node. So if the PDU supply is off,
the node is off. I have not seen a PDU with multiple inputs yet (but
there may be such device, I am no expert on that).

Regards
Dominik

 And I don't think we need to manage UPS or PDU's failure by our self, the
 manufacturer should take responsibility of this. Am I correct?
 
 
 But yes, probably need additional budgets for this.
 
 Anyway, again, thanks for your advice. I'm going to do some research on
 them.
 
 
 
 On Thu, Apr 1, 2010 at 6:38 AM, Dominik Klein d...@in-telegence.net wrote:
 
 Tony Gan wrote:
 Hi,
 For a two-node cluster, what are the best STONITH devices?

 Currently I am using Dell's iDrac for STONITH device. It works pretty
 well.
 However the biggest problem for iDrac or any other lights-out devices is
 that they share power supply with hosts machines.

 Once an active machine lost its power completely, you want to fail-over
 to
 the backup-node in your cluster.
 But with iDrac as your STONITH device you can not, because the STONITH
 resource on backup node will run into error (fail to connect to STONITH
 device, it's out of power too) , and refuse to start any resources.


 I was wondering what kind of STONITH devices everybody is using to solve
 this problem. And how much are they?

 Actually Pacemaker's page have a link talking about this:
 http://www.clusterlabs.org/doc/crm_fencing.html

 It suggests UPS (Uninterruptible Power Supply) as well as PDU (Power
 Distribution Unit).
 Anybody used them before? How well are they integrated with Heartbeat?
 What
 are the pros and cons?

 Hi

 I am using APC PDUs for my clusters.

 The setup is like:

 power supply circuit 1 - pdu 1 - node 1
 power supply circuit 2 - pdu 2 - node 2

 If a node fails, the corresponding pdu usually is accessible and
 manageable.

 However, if a pdu fails (and they probably can fail in ways we cannot
 really imagine (to quote Dejan)) that renders the same problem as yours.
 The node is down, the stonith device is down, so no resource takeover.

 But imho, this is not resolvable. At least I do not know of a way how
 to. If a PDU or UPS fails (node down and power device down), then the
 resources for the failed node will not be recovered since the failed
 node cannot be shot.

 Regards
 Dominik
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Need suggestion on STONITH device

2010-04-01 Thread Dominik Klein
Tony Gan wrote:
 Hi,
 For a two-node cluster, what are the best STONITH devices?
 
 Currently I am using Dell's iDrac for STONITH device. It works pretty well.
 However the biggest problem for iDrac or any other lights-out devices is
 that they share power supply with hosts machines.
 
 Once an active machine lost its power completely, you want to fail-over to
 the backup-node in your cluster.
 But with iDrac as your STONITH device you can not, because the STONITH
 resource on backup node will run into error (fail to connect to STONITH
 device, it's out of power too) , and refuse to start any resources.
 
 
 I was wondering what kind of STONITH devices everybody is using to solve
 this problem. And how much are they?
 
 Actually Pacemaker's page have a link talking about this:
 http://www.clusterlabs.org/doc/crm_fencing.html
 
 It suggests UPS (Uninterruptible Power Supply) as well as PDU (Power
 Distribution Unit).
 Anybody used them before? How well are they integrated with Heartbeat? What
 are the pros and cons?

Hi

I am using APC PDUs for my clusters.

The setup is like:

power supply circuit 1 - pdu 1 - node 1
power supply circuit 2 - pdu 2 - node 2

If a node fails, the corresponding pdu usually is accessible and manageable.

However, if a pdu fails (and they probably can fail in ways we cannot
really imagine (to quote Dejan)) that renders the same problem as yours.
The node is down, the stonith device is down, so no resource takeover.

But imho, this is not resolvable. At least I do not know of a way how
to. If a PDU or UPS fails (node down and power device down), then the
resources for the failed node will not be recovered since the failed
node cannot be shot.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] messages from existing hearbeat on the same lan

2010-01-19 Thread Dominik Klein
Aclhk Aclhk wrote:
 On the same lan, there are already two heartbeat node 136pri and 137sec.
 
 I setup another 2 nodes with heartbeat. they keep receiving the following 
 messages:
 
 heartbeat[9931]: 2010/01/19_10:53:01 WARN: string2msg_ll: node [136pri] 
 failed authentication
 heartbeat[9931]: 2010/01/19_10:53:02 WARN: Invalid authentication type [1] in 
 message!
 heartbeat[9931]: 2010/01/19_10:53:02 WARN: string2msg_ll: node [137sec] 
 failed authentication
 heartbeat[9931]: 2010/01/19_10:53:02 WARN: Invalid authentication type [1] in 
 message!
 
 ha.cf
 debugfile /var/log/ha-debug
 logfile /var/log/ha-log
 logfacility local0
 bcast eth0
 keepalive 5
 warntime 10
 deadtime 120
 initdead 120
 auto_failback off
 node 140openfiler1
 node 141openfiler2
 
 bcast for all nodes are same, that is eth0
 
 pls advise how to avoid the messages.

Use mcast or ucast instead of bcast?
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] ulimit in ocf scripts

2010-01-13 Thread Dominik Klein
Andrew Beekhof wrote:
 On Tue, Jan 12, 2010 at 10:43 AM, Raoul Bhatia [IPAX] r.bha...@ipax.at 
 wrote:
 On 01/12/2010 10:39 AM, Florian Haas wrote:
 Why not simply set that for root at boot? (it rhymes too :)
 because i do not like the idea that each and every process gets
 elevated limits by default.

 i think that there *should* be a generic way to configure ulimits an a
 per resource basis.
 I'm confident Dejan would be happy to accept a patch in which you add
 such a parameter to each resource agent where it makes sense.
 of course this would be possible. but i *think* it is more helpful to
 add this to e.g. the cib/lrmd/you name it.

 so before i/we implement the ulimit stuff *inside* lots of different
 RAs, i'd like to hear beekhof's or lars' comments.
 
 If you want a configurable per-resource limit - thats a resource parameter.
 Why would we want to implement another mechanism?

Of course this would be a resource parameter.

I think what he meant to say was that he does not want to have the
change inside every RA executing the ulimit command but to have some
cluster component (probably lrmd) do that.

Regards
Dominik
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] [PATCH]Support of stop escalation for mysql-RA.

2009-12-01 Thread Dominik Klein
I'd suggest an approach like Florian's from the Virtualdomain RA. Here's
a quote, guess you get the idea.

shutdown_timeout=$((($OCF_RESKEY_CRM_meta_timeout/1000)-5))

Regards
Dominik

Dejan Muhamedagic wrote:
 Hi Hideo-san,
 
 On Mon, Nov 30, 2009 at 11:00:05AM +0900, renayama19661...@ybb.ne.jp wrote:
 Hi,

 We discovered a problem in an test of mysql.

 It is the problem that mysql cannot stop.
 This problem seems to occur at the time of diskfull and high CPU load.

 We included an escalation stop like pgsql.

 The problem is broken off by this revision, and a stop succeeds.

 Please commit this patch in a development version.
 
 Many thanks for the patch. I'm just not sure about the default
 escalate time. You set it to 30 seconds, perhaps it should be set
 to something longer. Otherwise some cluster configurations where
 the stop operation takes longer may have problems. I have no idea
 which value should we use, but I would tend to make it longer rather
 than shorter.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] heartbeat - execute a script on a running node when the other node is back?

2009-11-16 Thread Dominik Klein
Tomasz Chmielewski wrote:
 Dejan Muhamedagic wrote:
 Hi,

 On Sun, Nov 15, 2009 at 09:09:53PM +0100, Tomasz Chmielewski wrote:
 I have two nodes, node_1 and node_2.

 node_2 was down, but is now up.


 How can I execute a custom script on node_1 when it detects that node_2 
 is back?
 That's not possible. What would you want to with that script?
 
 I have two PostgreSQL servers running; pgpool-ii is started by Heartbeat 
 to distribute the load (reads) among two servers and to send writes to 
 both servers.
 
 When one PostgreSQL server fails, the setup will still work fine. When 
 the failed PostgreSQL instance is back, the data should be first 
 synchronized from the running PostgreSQL server to a server which was 
 failed a while ago.
 
 It is best if such a script could be started by Heartbeat running on the 
 active node, as soon as it detects that the other node is back.

If you need such thing - I'd personally be most comfortable with not
starting the cluster at boot time. Then you can do whatever you need to
do and then - when you _know_ everything is right, the script is done
etc. - start the cluster software.

Just my personal preference.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Restrict resources to specific nodes only

2009-09-29 Thread Dominik Klein
Kenneth Simbron wrote:
 Hi,
 
 Is there a way to restrict some resources to work only on specific nodes and
 other resources on another nodes?

http://clusterlabs.org/mediawiki/images/f/fb/Configuration_Explained.pdf

Read up on location constraints.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] Monitor operation for the Filesystem RA

2009-09-16 Thread Dominik Klein
Dejan Muhamedagic wrote:
 Hi Florian,
 
 On Wed, Sep 16, 2009 at 08:25:30AM +0200, Florian Haas wrote:
 Lars, Dejan,

 as discussed on #linux-ha yesterday, I've pushed a small changeset to
 the Filesystem RA that implements a monitor operation which checks
 whether I/O on the mounted filesystem is in fact possible. Any
 suggestions for improvement would be most welcome.
 
 IMO, the monitor operation is now difficult to understand.  I
 don't mean the code, I didn't take a look at the code yet, but
 the usage. Also, as soon as you set the statusfile_prefix
 parameter, the 0 depth monitor changes behaviour. I don't find
 that good. The basic monitor operation should remain the same and
 just test if the filesystem is mounted as it always used to.

I agree.

 The new parameter should influence only the monitor operations of
 higher (deeper :) depth. So, I'd propose to have two depths, say
 10 and 20, of which the first would be just the read test and the
 second read-write.

Why not 1 and 2?

Then we'd have
0 = old behaviour
1 = read
2 = read/write

 Finally, the statusfile_prefix should be optional for deeper
 monitor operations and default to .${OCF_RESOURCE_INSTANCE}. If
 OCF_RESOURCE_INSTANCE doesn't contain the clone instance, then we
 should append the clone instance number (I suppose that it's
 available somewhere).

As fgh said, when you want to monitor a readonly fs, you'd have to know
the clone instance number for creating the file to read from. Not a good
idea imho. Or you'll have several files around which would be even more
ugly when you think about a larger cluster.

Why do we have to make the name configurable at all? Why not just give
it a generic name and only let the user configure OCF_CHECK_LEVEL for
each monitor? That said, I have not dealt with cluster filesystems yet.
Was the hostname-idea to avoid having multiple monitor instances trying
to write to one file and maybe run into locking/timeout issues?

Regards
Dominik

 I hope that this way the usage would be more straightforward. At
 least it looks so to me.
 
 Do I win the prize for the longest changeset description or what? ;)
 
 We need good documentation. I think it's great to write such
 descriptions :)
 
 Cheers,
 
 Dejan
 
 Cheers,
 Florian
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] how to get group members

2009-08-19 Thread Dominik Klein
Ivan Gromov wrote:
 Hi, all
 
 How to get group members?
 I use crm_resource -x -t group -r group_Name. Can I get members without 
 xml part?

What about

crm configure show group-name ?

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Constraints works for one resource but not for another

2009-08-17 Thread Dominik Klein
Tobias Appel wrote:
 Hi,
 
 I have a very weird error with heartbeat version 2.14.
 
 I have two IPMI resources for my two nodes. The configuration is posted 
 here: http://pastebin.com/m52c1809c
 
 node1 is named nagios1
 node2 is named nagios2
 
 now I have ipmi_nagios1 (which should run on nagios2 to shutdown nagios1)
 and ipmi_nagios2 (which should run on nagios1 to shutdown nagios2).
 
 It's confusing I know.
 
 Now I set up to constraints which force with score infinity a resource 
 to only run on their designated node.
 
 For the resource ipmi_nagios2 it works without a problem. It only runs 
 on nagios1 and is never started on nagios2. But the other resource which 
 is identically configured (just the hostname differs) does not work - 
 heartbeat always wants to start it on nagios1 and very seldom starts it 
 on nagios2. Just now it failed to start on nagios1 and I hit clean up 
 resource, waited a bit, failed again and after 3 times the cluster went 
 havoc and turned off one of the nodes!
 
 I even tried to set a constraint via the UI - it's then labeled 
 cli-constraint-name but even with this as well heartbeat still tried to 
 start it on the wrong node!
 
 Now I'm really at a loss, maybe my configuration is wrong, or maybe it 
 really is a bug in heartbeat.
 
 Here is the link to the configuration again: http://pastebin.com/m52c1809c
 
 I honestly don't know what to do anymore. I have to stop the ipmi 
 service at the moment because otherwise it might randomly turn off one 
 of the nodes, but without it we don't have any fencing so it's a quite 
 delicate situation at the moment.
 
 Any input is greatly appreciated.
 
 Regards,
 Tobias

The constraints look okay, but without logs, we cannot say why it does
not do what you want.

Also: look at the stonith device configuration: Is it okay for both
primitives to have the same ip configured? I'd guess that will not
successfully start the resource!? Maybe that's it already.

I'd guess there was some failure before which brought up this situation
(probably stonith start fail and stop fail)?

It shouldn't turn off nodes at random. There's usually a pretty good
reason when the cluster does this.

Btw: the better way to make sure, a particular resource is only started
on one node but never on the other is usually to configure -INFINITY for
the other node instead of INFINITY for the node you want it to run on.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pacemaker 1.4 HBv2 1.99 // About quorum choice (contd.)

2009-08-07 Thread Dominik Klein
Alain.Moulle wrote:
 Hello Andrew,
 Could you explain why this functionnality is no more available 
 (configuration
 lines remain in ha.cf) ?

ipfail was replaced by pingd in v2. That was in the very first version
of v2 afaik.

 And how should we proceed to avoid split-brain cases in a two-nodes 
 cluster  in case
 of problems on heartbeat network ?

make network networks (plural) to reduce the chance of getting into
a split-brain sitatuation and get and configure stonith devices to
protect your data in case it happens anyways.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Command to see if a resource is started or not

2009-08-05 Thread Dominik Klein
Tobias Appel wrote:
 Hi,
 
 I need a command to see if a resource is started or not. Somehow my IPMI 
 resource does not always start, especially on one node (for example if I 
 reboot the node, or have a failover). There is no error and nothing, it 
 just does nothing at all.
 Usually I have to clean up the resource and then it comes back by itself.
 This is not really a problem since this only occurs after a failover or 
 reboot and when that happens, somebody usually takes a look at the 
 cluster anyway. But some people forget to start it again, and when we do 
 maintenance we have to turn it off on purpose since it would go wreck 
 havoc and turn off one of the nodes.
 
 So all I need is a command line tool to check wether a resource is 
 currently started or not. I tried to check the resources with the 
 failcount command, but it's always 0. And the crm_resource command is 
 used to configure a resource but does not seem to give me the status of 
 a resource.
 
 I know I can use crm_mon but I would rather have a small command since I 
 could include this in our monitoring tool (nagios).

crm resource status resid

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Command to see if a resource is started or not

2009-08-05 Thread Dominik Klein
Tobias Appel wrote:
 On 08/05/2009 10:30 AM, Dominik Klein wrote:
 Tobias Appel wrote:
 So all I need is a command line tool to check wether a resource is
 currently started or not. I tried to check the resources with the
 failcount command, but it's always 0. And the crm_resource command is
 used to configure a resource but does not seem to give me the status of
 a resource.

 I know I can use crm_mon but I would rather have a small command since I
 could include this in our monitoring tool (nagios).
 crm resource statusresid

 Regards
 Dominik
 
 Thanks for the fast reply Dominik,
 
 I forgot to mention that I still run Heartbeat version 2.1.4.
 It seems crm_resource does not respond to the status flag. Or am I too 
 stupid?

It is not crm_resource, I meant crm resource (notice the blank).

But the crm command is not in 2.1.4

Try crm_resource -W -r resid

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pacemaker 1.4 HBv2 1.99 // About quorum choice (contd.)

2009-08-05 Thread Dominik Klein
Alain.Moulle wrote:
 Thanks Andrew,
 
 1. So my understanding is that in a more than 2 nodes cluster , if
 two nodes are failed, the have_quorum is set to 0 by the cluster soft
 and the behavior is choosen by the administrator with the no-quorum-policy
 parameter. So the question is now : what is the best choice for 
 no-quorum-policy
 value ? My feeling is that ignore would be the best choice if all services
 can run without problems on the remaining healthy nodes.

That's not the only case this can happen. If you run into split-brain,
each node may be healthy but the network connections may be broken. With
ignore, you will end up with resources running multiple times. That's
a problem sometimes ;)

Don't use ignore in 2 node clusters.

 suicide or stop : my understanding is that it will kill the 
 remaining healthy nodes or
 stop the services running on them, so it does not sound good for me ... 
 freeze : don't see the difference between freeze and ignore ... ?
 
 Am I right ?
 
 2. and what about the quorum policy in a two-nodes cluster ?

You need working stonith and policy=ignore, as no node can have 50% on
its own. When the connection is lost, one node will shoot the other. The
cluster software should not be started at boot time, otherwise you will
end up in a stonith death match. There was quite a nice explanation on
the pacemaker list some time ago. Look for STONITH Deathmatch Explained
in the archives.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] updating cib without status attributes

2009-07-28 Thread Dominik Klein
 You can try to compose the output of cibadmin -Q -o
 crm_config|resources|constraints to something usable for you.

 
 looks like I have to run the command once for each type and then
 concatenate the results.

That's sort of what I meant to say. Sorry for being unclear.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Adding to a group without downtime

2009-07-28 Thread Dominik Klein
Gavin Hamill wrote:
 Hi :)
 
 I'm using the Lenny packages http://people.debian.org/~madkiss/ha/ and
 have been enjoying success with pacemaker + heartbeat (I've used a
 heartbeat v1 config for years without problems).
 
 I have a few IPaddr2 primitives in groups, but I'd like to understand
 how I can add a new primitive into an existing group without
 stopping/starting. At the moment I have to:
 
 # primitive failover-ip-186 ocf:heartbeat:IPaddr2 params ip=10.8.2.186
 op monitor interval=10s 
 # show www-frontends 
 group www-frontends failover-ip-184 failover-ip-185
 # delete www-frontends 
 # group www-frontends failover-ip-184 failover-ip-185 failover-ip-186
 # commit 
 
 But that takes failover-ip-184 and failover-ip-185 down for a couple of
 seconds. Is there a way to add a new primitive to a group with zero
 downtime?
 
 I tried using 'edit www-frontends' to stick failover-ip-186 on the end
 of the line but it complains loudly Call cib_modify failed (-47):
 Update does not conform to the configured schema/DTD

Which version? Can you post the actual in- and output?

Afaik, that's supposed to work.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] updating cib without status attributes

2009-07-27 Thread Dominik Klein
 Is there a query/config dump setting that will dump the running config to the 
 command line without the status attributes?

cibadmin -Q -o configuration
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] updating cib without status attributes

2009-07-27 Thread Dominik Klein
Dave Augustus wrote:
 On Mon, 2009-07-27 at 15:09 +0200, Dominik Klein wrote:
 Is there a query/config dump setting that will dump the running config to 
 the command line without the status attributes?
 cibadmin -Q -o configuration
 
 What a quick reply!
 
 However I get: 
 
 Call cib_query failed (-48): Invalid CIB section specified
 

Might be quering configuration was only implemented later. Honestly no
idea.

You can try to compose the output of cibadmin -Q -o
crm_config|resources|constraints to something usable for you.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Stonith with APC Smart UPS 1000 +Network ManagementCard

2009-07-10 Thread Dominik Klein
Ehlers, Kolja wrote:
 Yeah it supports SSH but if I log in using SSH there is just a menu to 
 configure the card. Since I can enter only 2 digits at that
 prompt
 
  1- Control
  2- Diagnostics
  3- Configuration
  4- Detailed Status
  5- About UPS
 
  ESC- Back, ENTER- Refresh, CTRL-L- Event Log
 
 I think it will not accept commands to shut down a node. But maybe someone 
 knows better? 

That looks about like what the cfg menu of my AP7920 looks like. I've
never seen a cmdline interface.

Try activating snmp and manage outlets via snmp.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Stonith with APC Smart UPS1000 +Network ManagementCard

2009-07-10 Thread Dominik Klein
The stonith plugins are part of the heartbeat package iirc. I think
there were reports in the past that those were built without snmp
support and thus the apcmastersnmp plugin was not included. Maybe you
have to build it yourself.

I have no idea whether it will work for your device. But if it is just a
different mib, it should not be too hard to modify.

Regards
Dominik

Ehlers, Kolja wrote:
 Hello,
 
 do you think that the apcmastersnmp plugin will work? Dejan noted that the 
 MIB probably will not fit. I have installed the
 pacemaker-mgmt-1.99.1-2.2.i586.rpm package and also am monitoring my cluster 
 through snmp but the apcmastersnmp plugin is not
 installed. Is it in one of those packages:
 
 pacemaker-mgmt-client-1.99.2-1.1.i586.rpm 
 pacemaker-mgmt-devel-1.99.2-1.1.i586.rpm  
 
 -Ursprüngliche Nachricht-
 Von: linux-ha-boun...@lists.linux-ha.org 
 [mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Dominik Klein
 Gesendet: Freitag, 10. Juli 2009 08:27
 An: General Linux-HA mailing list
 Betreff: Re: [Linux-HA] Stonith with APC Smart UPS1000 +Network ManagementCard
 
 Ehlers, Kolja wrote:
 Yeah it supports SSH but if I log in using SSH there is just a menu to 
 configure the card. Since I can enter only 2 digits at that prompt

  1- Control
  2- Diagnostics
  3- Configuration
  4- Detailed Status
  5- About UPS

  ESC- Back, ENTER- Refresh, CTRL-L- Event Log

 I think it will not accept commands to shut down a node. But maybe someone 
 knows better? 
 
 That looks about like what the cfg menu of my AP7920 looks like. I've never 
 seen a cmdline interface.
 
 Try activating snmp and manage outlets via snmp.
 
 Regards
 Dominik
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 
 Geschäftsführung: Dr. Michael Fischer, Reinhard Eisebitt
 Amtsgericht Köln HRB 32356
 Steuer-Nr.: 217/5717/0536
 Ust.Id.-Nr.: DE 204051920
 --
 This email transmission and any documents, files or previous email
 messages attached to it may contain information that is confidential or
 legally privileged. If you are not the intended recipient or a person
 responsible for delivering this transmission to the intended recipient,
 you are hereby notified that any disclosure, copying, printing,
 distribution or use of this transmission is strictly prohibited. If you
 have received this transmission in error, please immediately notify the
 sender by telephone or return email and delete the original transmission
 and its attachments without reading or saving in any manner.
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resource set question

2009-07-10 Thread Dominik Klein
Steinhauer Juergen wrote:
 Hi guys!
 
 In my cluster setup, I have 6 IP addresses which should be started in
 parallel for speed purpose, and two apps, depending on the six addresses.
 
 What would be the best way to configure this?
 Putting all IPs in a group will start them one after another. Bad.

Set ordered=false for the ip group. That will start them in parallel. I
think. Then specify a resource order constraint to start your app group
after the ip group and a colocation constraint to have the apps on the
same node as the ips.

Regards
Dominik

 A colocation with a set including the IPs (sequential=false) and the
 two apps will not migrate the apps (and the remaining IPs), if one IP
 should fail...
 
 I'm happy about every proposal.
 
 Regards
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Master-slave, stopping a slave.

2009-07-08 Thread Dominik Klein
c smith wrote:
 Hi-
 
 I currently implement DRBD with Pacemaker.  The DRBD resource is configured
 as a multi-state Master-slave resource in which node1 is the default master
 and node2 is the default slave.  I am putting together a backup system that
 will run  some automated scheduled tasks on node2 (assuming both nodes are
 online).  Such tasks include disconnecting the DRBD resource temporarily and
 promoting it to primary/standalone.  I'm wondering, is there a way to
 manually stop the child clone on that node before starting the backup, then
 restarting it after?  Something along the lines of `crm resource ms-drbd:1
 stop`.
 
 I have yet to test the system as it is but, from what I understand, if I
 manually disconnect the resource, Pacemaker's monitor functions will not be
 happy and likely throw monitor errors and/or try to restart and reconnect
 the resource.  The Pacemaker's user guide does not go into detail regarding
 managing multi-state child resources.  Is it possible to do without stopping
 the entire m-s resource?

It has been stated that is it not intended to mess with the internals of
child-management. Never to use child-instance-ids in constraints and so
on. They will just cause you unwanted trouble.

I'd probably just unmanage the drbd resource, then the cluster will not
monitor it. Then do whatever you need to do and afterwards, let the
cluster manage drbd again.

Look into the is-managed meta attribute.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Master-slave, stopping a slave.

2009-07-08 Thread Dominik Klein
c smith wrote:
 Dominik-
 
 Thanks for the reply.. I'm aware that the documents advise against it, but
 surely there must be a way.  I was just looking at the new DRBD 8.3.2.  It
 includes a fencing handler script that, upon failure of a DRBD master, adds
 a -INFINITY location constraint into the CIB that prevents the MS resource
 from being 'Master' on the failed node until resync/uptodate (at which point
 its deleted from the CIB)
 
 Pasted from ./drbd-8.3.2/scripts/crm-fence-peer.sh:
 
 new_constraint=\
 rsc_location rsc=\$master_id\ id=\$id_prefix-$master_id\
   rule role=\Master\ score=\-INFINITY\
 id=\$id_prefix-rule-$master_id\
 expression attribute=\$fencing_attribute\ operation=\ne\
 value=\$fencing_value\ id=\$id_prefix-expr-$master_id\/
   /rule
 /rsc_location
 
 This constraint, when in place, succesfully stops the child from running on
 that node.  Wondering if it is possible to do the same with the slave
 resource.
 
 I will also try out unmanage'ing the resource and see if that will
 accomplish similar. 

It will not stop the slave. It will unmanage the entire m/s drbd
resource, meaning the cluster will no longer monitor its status. So if
you disconnect drbd and do your backup business of whatever kind, the
cluster will not notice. Then you can re-connect drbd and let pacemaker
manage the resource again.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Add resource to a group

2009-06-26 Thread Dominik Klein
David Hoskinson wrote:
 I tested the lsb for amavis and it passed all the tests.  I was seeing
 timeout errors in the logs saying over 20s.  By specifying only timeout
 instead of both interval and timeout in my start op is that what you mean...
 Such as
 
 Primitive amavisd lsb:amavisd op monitor timeout=45s

You're supposed to add a _start_ op, not a _monitor_ op.

Like

primitive ... op start timeout=45s

Regards
Dominik

 Thanks for the help
 
 
 On 6/25/09 8:44 AM, David Hoskinson dhoskin...@eng.uiowa.edu wrote:
 
 Hmmm I understand I will try this in a bit, thanks for the tip


 On 6/25/09 1:20 AM, Dominik Klein d...@in-telegence.net wrote:

 David Hoskinson wrote:
 Thanks Got it going again.  However my amavisd service fails with a
 unknown exec error.  Its the only one that won't work, and isn't related to
 the group question.  I have it setup the same as postfix, dovecot, etc.

 Primitive amavisd lsb:amavisd op monitor interval=30s timeout=30s
 Is that amavisd script LSB compliant? See
 http://www.linux-ha.org/LSBResourceAgent for how to check on that.

 Just wondering if its taking too long to start or what...  If you have any
 ideas that would be great.
 If start takes longer than the default operation timeout (20s by
 default), then you should see that in the logs.

 You can work around that by specifying a start op with timeout=whatever
 amount of time you need. Make sure you do NOT set an interval for the
 start op. Seems to be a common mistake.

 Regards
 Dominik
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Add resource to a group

2009-06-25 Thread Dominik Klein
David Hoskinson wrote:
 Thanks Got it going again.  However my amavisd service fails with a
 unknown exec error.  Its the only one that won't work, and isn't related to
 the group question.  I have it setup the same as postfix, dovecot, etc.
 
 Primitive amavisd lsb:amavisd op monitor interval=30s timeout=30s

Is that amavisd script LSB compliant? See
http://www.linux-ha.org/LSBResourceAgent for how to check on that.

 Just wondering if its taking too long to start or what...  If you have any
 ideas that would be great.

If start takes longer than the default operation timeout (20s by
default), then you should see that in the logs.

You can work around that by specifying a start op with timeout=whatever
amount of time you need. Make sure you do NOT set an interval for the
start op. Seems to be a common mistake.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Failover problem

2009-06-25 Thread Dominik Klein
The default value for stonith-enabled is true. If you however do not
have a stonith device, that will give you an endless loop of
unsuccessfully trying to shoot the other node before doing anything else
to the resources the dead node was running.

try

crm configure property stonith-enabled=false
crm configure property no-quorum-policy=ignore

Be warned though! Your cluster should not go into production like this!

David Hoskinson wrote:
 Im sorry this is maybe where my knowledge is lacking.  I don't have the
 hardware for a third node, but I understand your reasoning
 
 Don't understand how to add stonith and haven't found a good document for
 that... I also get No STONITH resources have been defined when I do a
 crm_verify -LV
 
 Don't know how to set quorom policy to ignore.
 
 Which of the last 2 would you suggest, and where to look for info on how to
 do it.
 
 thanks
 
 
 On 6/24/09 3:26 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote:
 
 On Wed, Jun 24, 2009 at 02:05:46PM -0500, David Hoskinson wrote:
 System running 2.99 heartbeat and pacemaker 1.04.  Running fine in master
 slave mode.  However if I shut down the slave server, all the services stop
 on the master until the slave comes back up, does the election and once
 again starts the services on the master.  This doesn't seem to be the way it
 should be.  Same thing if I shut the master down.  Services go off line
 until master is back up.
 Two node cluster, one vote down,
 50% is NOT majority - single node has no quorum.
 Quorum policy probably says: no quorum - stop.
 You need to
  - add more nodes (just to have a real quorum), and/or
  - add stonith, and/or
  - set quorum policy to ignore.
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Monitoring resources

2009-05-26 Thread Dominik Klein
Koen Verwimp wrote:
 Hi! 
 
   
 
 I have defined a resources called rg_alfresco_ip . This resource consists of 
 a OCF script (AlfrescoIP). This is script is a copy of IPAddr but with a 
 customized status/monitoring procedure. 
 
   
 
 group id= rg_alfresco_ip  
 
 primitive class=ocf type=AlfrescoIP provider=heartbeat id=ip_AS 
 
   meta_attributes 
 
 attributes 
 
   nvpair id=Alfresco_ip_meta2 name=resource_failure_stickiness 
 value=-INFINITY/ 
 
 /attributes 
 
   /meta_attributes 
 
   instance_attributes id=ip_alfresco_attributes 
 
 attributes 
 
   nvpair id=IPaddr2_AS_pair1 name=ip value=192.168.103.52/ 
 
   nvpair id=IPaddr2_AS_pair2 name=nic value=eth0/ 
 
   nvpair id=IPaddr2_AS_pair3 name=iflabel value=VIP_AS/ 
 
   nvpair id=IPaddr2_AS_pair4 name=broadcast 
 value=192.68.103.255/ 
 
   nvpair id=IPaddr2_AS_pair5 name=cird_netmask 
 value=255.255.255.0/ 
 
 /attributes 
 
   /instance_attributes 
 
   operations 
 
 op id=alfresco_ip_op1 name=start timeout=60s prereq=fencing 
 on_fail=restart/ 
 
 op id=alfresco_ip_op2 name=stop timeout=60s on_fail=fence/ 
 
 op id=alfresco_ip_op3 start_delay=30s name=monitor 
 interval=30s timeout=10s on_fail=restart/ 
 
   /operations 
 
 /primitive 
 
   /group 
 
   
 
   
 
 Because I have configured a constraint  (see below), rg_alfresco_ip will be 
 started on his default node (DocuCluster03). 
 
   
 
 rule id=prefered_location_alfresco score=100 
 
   expression attribute=#uname id=prefered_location_alfresco_expr 
 operation=eq value=DocuCluster03 / 
 
 /rule 
 
   
 
 You can see the monitor operation defined the resource group above. If the 
 monitor status returns a fail code, the on_fail procedure is set on 
 “restart”. Because the resource_failure_stickiness is equal to –INFINITY, the 
 resource will be restarted/fail over on my second server (DocuCluster04). 
 
   
 
 Problem: 
 
 If rg_alfresco_ip also fails on the second node (DocuCluster04), it don’t try 
 to fail_over on node1 (DocuCluster03) again. 
 
   
 
 Anyone an idea why Heartbeat doesn’t try to fail over back on node1 
 (DocuCluster03)? 

The score for both nodes is negative, ie the resource cannot run there.
End of story.

 Do I have to reset some error counter on node1? 

Yes. Try crm_failcount --help

And - since you're speaking of resource failure stickiness - please
consider upgrading to pacemaker.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Problems With SLES11 + DRBD

2009-05-04 Thread Dominik Klein
darren.mans...@opengi.co.uk wrote:
 Hello everyone. Long post, sorry.
 
  
 
 I've been trying to get SLES11 with Pacemaker 1.0 / OpenAIS working for
 most of this week without success so far. I thought I may as well bundle
 my problems into one mail to see if anyone can offer any advice.
 
  
 
 Goal: I'm trying to get a 2 node Active/Passive cluster working with
 DRBD replication, an ext3 FS on top of DRBD and a virtual IP. I want the
 active node to have a mounted FS that I can serve requests from using
 ProFTPD or another FTP daemon. If the active node fails I want the
 cluster to migrate all 4 resources (DRBD, FS, ProFTPD, Virtual IP)
 across to the other node. I don't have any STONITH devices at the
 moment.
 
  
 
 Approach: We are going with SLES11 with Pacemaker 1.0.3 and OpenAIS
 0.80.3, after already using SLES10SP2 with Heartbeat 2.1.4 and
 ldirectord in a live running 2-node Active/Active cluster. We are using
 LVM under DRBD for future disk expansion.
 
  
 
 Problem1 - Using DRBD OCF RA: I wanted to use the latest and greatest
 for the approaches, so tried the DRBD OCF RA following this howto:
 http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 . The configuration works
 and I can manually migrate resources but if I just reboot the node that
 has the drbd resource on it I see the resource gets migrated to the
 other node for about 2 seconds then is stopped:
 
  
 
 Normal operation:
 
 
 
 Last updated: Fri May  1 16:33:00 2009
 
 Current DC: gihub2 - partition with quorum

And this is your reason. The no-quorum-policy default is stop (you
even configured it, see below), which means do not run any resources if
you do not have qorum. The node is alone, so it does not have quorum.

If you want it to run things anyway, set no-quorum-policy to ignore.
That would be the old heartbeat behaviour.

 Version: 1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a
 
 2 Nodes configured, 2 expected votes
 
 1 Resources configured.
 
 
 
  
 
 Online: [ gihub1 gihub2 ]
 
  
 
 drbd0   (ocf::heartbeat:drbd):  Started gihub1
 
  
 
  
 
 Reboot gihub1:
 
 
 
 Last updated: Fri May  1 16:35:34 2009
 
 Current DC: gihub2 - partition with quorum
 
 Version: 1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a
 
 2 Nodes configured, 2 expected votes
 
 1 Resources configured.
 
 
 
  
 
 Online: [ gihub2 ]
 
 OFFLINE: [ gihub1 ]
 
  
 
 drbd0   (ocf::heartbeat:drbd):  Started gihub2
 
  
 
  
 
 Then after a couple of seconds:
 
 
 
 Last updated: Fri May  1 16:37:11 2009
 
 Current DC: gihub2 - partition WITHOUT quorum
 
 Version: 1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a
 
 2 Nodes configured, 2 expected votes
 
 1 Resources configured.
 
 
 
  
 
 Online: [ gihub2 ]
 
 OFFLINE: [ gihub1 ]
 
  
 
  
 
 /var/log/messages says:
 
 May  1 16:46:33 gihub2 openais[5362]: [TOTEM] The token was lost in the
 OPERATIONAL state.
 
 May  1 16:46:33 gihub2 openais[5362]: [TOTEM] Receive multicast socket
 recv buffer size (262142 bytes).
 
 May  1 16:46:33 gihub2 openais[5362]: [TOTEM] Transmit multicast socket
 send buffer size (262142 bytes).
 
 May  1 16:46:33 gihub2 openais[5362]: [TOTEM] entering GATHER state from
 2.
 
 May  1 16:46:36 gihub2 kernel: drbd0: conn( WFConnection -
 Disconnecting ) 
 
 May  1 16:46:36 gihub2 kernel: drbd0: Discarding network configuration.
 
 May  1 16:46:36 gihub2 kernel: drbd0: Connection closed
 
 May  1 16:46:36 gihub2 kernel: drbd0: conn( Disconnecting - StandAlone
 ) 
 
 May  1 16:46:36 gihub2 kernel: drbd0: receiver terminated
 
 May  1 16:46:36 gihub2 kernel: drbd0: Terminating receiver thread
 
 May  1 16:46:36 gihub2 kernel: drbd0: disk( UpToDate - Diskless ) 
 
 May  1 16:46:36 gihub2 kernel: drbd0: drbd_bm_resize called with
 capacity == 0
 
 May  1 16:46:36 gihub2 kernel: drbd0: worker terminated
 
 May  1 16:46:36 gihub2 kernel: drbd0: Terminating worker thread
 
 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] entering GATHER state from
 0.
 
 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] Creating commit token
 because I am the rep.
 
 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] Saving state aru 6b high
 seq received 6b
 
 May  1 16:46:36 gihub2 lrmd: [5370]: info: rsc:drbd0: stop
 
 May  1 16:46:36 gihub2 cib: [5369]: notice: ais_dispatch: Membership
 400: quorum lost
 
 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] Storing new sequence id
 for ring 190
 
 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] entering COMMIT state.
 
 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] entering RECOVERY state.
 
 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] position [0] member
 2.21.4.41:
 
 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] previous ring seq 396 rep
 2.21.4.40
 
 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] aru 6b high delivered 6b
 received flag 1
 
 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] Did not need to originate
 any messages in recovery.
 
 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] Sending 

Re: [Linux-HA] Problems With SLES11 + DRBD

2009-05-04 Thread Dominik Klein
Dominik Klein wrote:
 darren.mans...@opengi.co.uk wrote:
 Hello everyone. Long post, sorry.

  

 I've been trying to get SLES11 with Pacemaker 1.0 / OpenAIS working for
 most of this week without success so far. I thought I may as well bundle
 my problems into one mail to see if anyone can offer any advice.

  

 Goal: I'm trying to get a 2 node Active/Passive cluster working with
 DRBD replication, an ext3 FS on top of DRBD and a virtual IP. I want the
 active node to have a mounted FS that I can serve requests from using
 ProFTPD or another FTP daemon. If the active node fails I want the
 cluster to migrate all 4 resources (DRBD, FS, ProFTPD, Virtual IP)
 across to the other node. I don't have any STONITH devices at the
 moment.

  

 Approach: We are going with SLES11 with Pacemaker 1.0.3 and OpenAIS
 0.80.3, after already using SLES10SP2 with Heartbeat 2.1.4 and
 ldirectord in a live running 2-node Active/Active cluster. We are using
 LVM under DRBD for future disk expansion.

  

 Problem1 - Using DRBD OCF RA: I wanted to use the latest and greatest
 for the approaches, so tried the DRBD OCF RA following this howto:
 http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 . The configuration works
 and I can manually migrate resources but if I just reboot the node that
 has the drbd resource on it I see the resource gets migrated to the
 other node for about 2 seconds then is stopped:

  

 Normal operation:

 

 Last updated: Fri May  1 16:33:00 2009

 Current DC: gihub2 - partition with quorum
 
 And this is your reason. 

Bla. (read below)

 The no-quorum-policy default is stop (you
 even configured it, see below), which means do not run any resources if
 you do not have qorum. The node is alone, so it does not have quorum.
 
 If you want it to run things anyway, set no-quorum-policy to ignore.
 That would be the old heartbeat behaviour.
 
 Version: 1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a

 2 Nodes configured, 2 expected votes

 1 Resources configured.

 

  

 Online: [ gihub1 gihub2 ]

  

 drbd0   (ocf::heartbeat:drbd):  Started gihub1

  

  

 Reboot gihub1:

 

 Last updated: Fri May  1 16:35:34 2009

 Current DC: gihub2 - partition with quorum

 Version: 1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a

 2 Nodes configured, 2 expected votes

 1 Resources configured.

 

  

 Online: [ gihub2 ]

 OFFLINE: [ gihub1 ]

  

 drbd0   (ocf::heartbeat:drbd):  Started gihub2

  

  

 Then after a couple of seconds:

 

 Last updated: Fri May  1 16:37:11 2009

 Current DC: gihub2 - partition WITHOUT quorum

Here you are without quorum.
Sorry.

Regards
Dominik

 Version: 1.0.3-0080ec086ae9c20ad5c4c3562000c0ad68374f0a

 2 Nodes configured, 2 expected votes

 1 Resources configured.

 

  

 Online: [ gihub2 ]

 OFFLINE: [ gihub1 ]

  

  

 /var/log/messages says:

 May  1 16:46:33 gihub2 openais[5362]: [TOTEM] The token was lost in the
 OPERATIONAL state.

 May  1 16:46:33 gihub2 openais[5362]: [TOTEM] Receive multicast socket
 recv buffer size (262142 bytes).

 May  1 16:46:33 gihub2 openais[5362]: [TOTEM] Transmit multicast socket
 send buffer size (262142 bytes).

 May  1 16:46:33 gihub2 openais[5362]: [TOTEM] entering GATHER state from
 2.

 May  1 16:46:36 gihub2 kernel: drbd0: conn( WFConnection -
 Disconnecting ) 

 May  1 16:46:36 gihub2 kernel: drbd0: Discarding network configuration.

 May  1 16:46:36 gihub2 kernel: drbd0: Connection closed

 May  1 16:46:36 gihub2 kernel: drbd0: conn( Disconnecting - StandAlone
 ) 

 May  1 16:46:36 gihub2 kernel: drbd0: receiver terminated

 May  1 16:46:36 gihub2 kernel: drbd0: Terminating receiver thread

 May  1 16:46:36 gihub2 kernel: drbd0: disk( UpToDate - Diskless ) 

 May  1 16:46:36 gihub2 kernel: drbd0: drbd_bm_resize called with
 capacity == 0

 May  1 16:46:36 gihub2 kernel: drbd0: worker terminated

 May  1 16:46:36 gihub2 kernel: drbd0: Terminating worker thread

 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] entering GATHER state from
 0.

 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] Creating commit token
 because I am the rep.

 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] Saving state aru 6b high
 seq received 6b

 May  1 16:46:36 gihub2 lrmd: [5370]: info: rsc:drbd0: stop

 May  1 16:46:36 gihub2 cib: [5369]: notice: ais_dispatch: Membership
 400: quorum lost

 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] Storing new sequence id
 for ring 190

 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] entering COMMIT state.

 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] entering RECOVERY state.

 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] position [0] member
 2.21.4.41:

 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] previous ring seq 396 rep
 2.21.4.40

 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] aru 6b high delivered 6b
 received flag 1

 May  1 16:46:36 gihub2 openais[5362]: [TOTEM] Did not need to originate
 any messages in recovery.

 May  1 16:46:36 gihub2 openais

[Linux-ha-dev] Patch: RA mysql

2009-04-24 Thread Dominik Klein
Trivial. See attached patch.

Regards
Dominik
exporting patch:
# HG changeset patch
# User Dominik Klein d...@in-telegence.net
# Date 1240578752 -7200
# Node ID 2d97904c385cc9b4779286001611bd748f48589d
# Parent  60cc2d6eee88ff6c2dedf7b539b9ee018efda6da
Low: RA mysql: Correctly remove eventually remaining socket

diff -r 60cc2d6eee88 -r 2d97904c385c resources/OCF/mysql
--- a/resources/OCF/mysql	Fri Apr 24 08:38:48 2009 +0200
+++ b/resources/OCF/mysql	Fri Apr 24 15:12:32 2009 +0200
@@ -419,7 +419,7 @@
 
 ocf_log info MySQL stopped;
 rm -f /var/lock/subsys/mysqld
-rm -f $OCF_RESKEY_datadir/mysql.sock
+rm -f $OCF_RESKEY_socket
 return $OCF_SUCCESS
 }
 
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] Assymetric Clustering

2009-04-22 Thread Dominik Klein
fsalas wrote:
 Hi, I'm quite new to clustering and HeartBeat, but as far as I know, a very
 nice packages.
 
 Well, here is my problem, I'm willing to setup a cluster for an small
 enterprise that will have several services located in virtual machines, to
 make it simpler, let's say we have four nodes in the cluster, in two nodes
 will run service A and in the other two nodes will run service B. I would
 like to make it in one cluster and not in two miniclusters because of later
 migration posibilities, easier administration,etc.
 
 I'm working with Ubuntu server 8.10 , with heartbeat-2 distrib packages.
 
 Now, Ive setup the first two nodes, with service A with no problem , service
 A is lsb, in my test is simply an apt-proxy with a virtual IP and a drbd for
 shared storage. After learning how to do it, it works flawesly, crm_verify
 didn't complain , all just fine.
 
 I've setup location rules to only let service A to run in these two nodes
 
 Then I decided to add the other two nodes to the cluster, to continue with
 service B, and here is the problem, even if these new nodes aren't allowed
 to run service A, it seems that the CRM tells the LRM to monitor service A
 on these nodes, as apt-proxy and drbd are not even installed there, it
 complains with errors with drbd, and failed on apt-proxy. AM I missing
 something here, or those test shouldn't be there, as location forbids those
 nodes for running this services.:,(
 
 I would really appreciate any light you can bring on this, as Im struggling
 with it for the last days.
 
 thanx in advance, and my apologies if my english is not as good as it
 should!
 
 :-D

We can only guess if you dont share configuration files and logs, but I
guess what you see is the probe operations returning not installed. A
probe is run on every node to find out in which state the resource is
there before doing anything to the resource.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Restart a service without the dependent services restarting?

2009-04-20 Thread Dominik Klein
Noah Miller wrote:
 Hi -
 
 Is it possible to restart a clustered service (v2 cluster) without its
 dependent services also stopping and starting?

When the constraint score is advisory (0), dependencies should not be
restarted, but then they are not really dependencies in the sense of
the word.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Re: Stopping the Heartbeat daemon does not stop the DRBD Daemon

2009-04-03 Thread Dominik Klein
Joe Bill wrote:
 
  Stopping the Heartbeat daemon (service heartbeat stop)
 does not stop the DRBD daemon even if it is one of
 the resources. 
 
 - Heartbeat and DRBD are 2 different products/packages
 
 - Like most services, DRBD doesn't need Heartbeat to run. You can set up and 
 run DRBD volumes without Heartbeat installed, or any cluster supervisor.
 
 - The DRBD daemons provide the communication interface for each network 
 volume and are therefor an integral part of the volume management. Without 
 the DRBD daemons, you (manually) and Heartbeat (automagically) could not 
 handle the DRBD volumes.

Just to avoid confusion: There is no such thing as a DRBD daemon. DRBD
is a kernel module.

 - If you look carefully at your startup, DRBD daemons start whether or not 
 Heartbeat is started.

That depends on your setup. Maybe in yours it does and it should. In
others it does not and it should not.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Re: Re: Re: Stopping the Heartbeat daemon does not stop the DRBD Daemon

2009-04-03 Thread Dominik Klein
 - The DRBD daemons provide the communication interface
 for each network volume and are therefor an integral
 part of the volume management. Without the DRBD daemons,
 you (manually) and Heartbeat (automagically) could not
 handle the DRBD volumes.
 Just to avoid confusion: There is no such thing as a DRBD daemon. DRBD
 is a kernel module. 
 
 Now I'm the one confused.
 What are these processes that show up when I ps -ef ?
 
 root..25621..0..2008..?00:00:00 [drbd7_worker]
 root.175581..0..2008..?00:00:00 [drbd7_receiver]
 root.246471..0.Jan02..?00:00:27 [drbd7_asender]
 
 Doesn't the '1'---^ here, mean 'root' detached ?

Those are the kernel threads (indicated by the enclosing [])

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HA Books

2009-04-02 Thread Dominik Klein
darren.mans...@opengi.co.uk wrote:
 Hi. Can anyone recommend any good books about HA with regards to the
 latest incarnations such as Pacemaker etc? I understand enough about the
 CRM and heartbeat 2 to get by but lots of the stuff on this list still
 goes over my head.
 
 Thanks.
 
 Darren Mansell

There was a discussion about books in january. Have a look in the archives.

Bottom line: There is one (1) book. It is in german and it is (no
offense Michael) somewhat outdated at least in some parts. Iirc, it was
written with version 2.1.3

I'd personally just go through the pdf docs from the clusterlabs site.
They cover a lot.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Stopping the Heartbeat daemon does not stop the DRBD Daemon

2009-04-02 Thread Dominik Klein
Jerome Yanga wrote:
 Stopping the Heartbeat daemon (service heartbeat stop) does not stop the DRBD 
 daemon even if it is one of the resources.
 
 # service heartbeat stop
 Stopping High-Availability services:
[  OK  ]
 # service drbd status
 drbd driver loaded OK; device status:
 version: 8.2.7 (api:88/proto:86-88)
 GIT-hash: 61b7f4c2fc34fe3d2acf7be6bcc1fc2684708a7d build by 
 r...@nomen.esri.com, 2009-03-24 08:29:57
 m:res  csst  ds  p  mounted  fstype
 0:r0   Unconfigured

It stops your drbd resource (device). It just does not unload the
module. That is the expected behaviour.

Regards
Dominik

 Running the command below stops the DRBD daemon.
 
 Service drbd stop
 
 
 Applications Installed:
 ===
 drbd-8.2.7-3
 heartbeat-2.99.2-6.1
 pacemaker-1.0.2-11.1
 
 
 CIB.xml:
 
 # crm configure show
 primitive fs0 ocf:heartbeat:Filesystem \
 params fstype=ext3 directory=/data device=/dev/drbd0
 primitive VIP ocf:heartbeat:IPaddr \
 params ip=10.50.26.250 \
 op monitor interval=5s timeout=5s
 primitive drbd0 ocf:heartbeat:drbd \
 params drbd_resource=r0 \
 op monitor interval=59s role=Master timeout=30s \
 op monitor interval=60s role=Slave timeout=30s
 group DRBD_Group fs0 VIP \
 meta collocated=true ordered=true migration-threshold=1 
 failure-timeout=10s resource-stickiness=10
 ms ms-drbd0 drbd0 \
 meta clone-max=2 notify=true globally-unique=false 
 target-role=Started
 colocation DRBD_Group-on-ms-drbd0 inf: DRBD_Group ms-drbd0:Master
 order ms-drbd0-before-DRBD_Group inf: ms-drbd0:promote DRBD_Group:start
 
 Help.
 
 Regards,
 jerome
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pingd/pacemaker

2009-04-01 Thread Dominik Klein
 I know. But this attrbiut does not exist in my setup. pacemaker verison 
 1.0.1-1. Is this a feature of 1.0.2?

1.0.1 is 4 months old. The RA was updated with those features 3 months
ago. So basically, yes. You could still update the single RA from the
mercurial repository though.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] showscores.sh for pacemaker 1.0.2

2009-04-01 Thread Dominik Klein
So here's an update. Michael Schwartzkopf pointed out a bug regarding
groups. That has been fixed now and the appropriate values should be
shown. Thanks!

There's not been a lot of feedback, is it because nobody uses the script
or does it just work for you?

Regards
Dominik

Dominik Klein wrote:
 Hi
 
 I made the necessary changes to the showscores script to work with
 pacemaker 1.0.2.
 
 Please test and report problems. Has been reported to work by some
 people and should go into the repository soon. Still, I'd like more
 people to test and confirm.
 
 Important changes:
 * correctly fetch stickiness and migration-threshold for complex
 resources (master and clone)
 * adjust column-width according to the length of resources' and nodes' names
 
 Regards
 Dominik


showscores.sh
Description: Bourne shell script
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Heartbeat v2 stickiness, score and more

2009-04-01 Thread Dominik Klein
florian.engelm...@bt.com wrote:
 Hello,
 I spent the whole afternoon to search for a good heartbeat v2
 documentation, but it looks like this is somehow difficult. Maybe
 someone in here can help me?
 
 Anyway I have a short question about stickiness. I only know about sun
 cluster but I have to build up knowledge about heartbeat cluster since
 we are running two debian heartbeat clusters now.
 Those two are failover clusters providing web services, nagios and
 vserver virtual hosts. Let's say resource_group_a is running on node1
 and resource_group_b on node2. If I reboot node2 resource_group_b will
 switch to node1. But if node2 is up again resource_group_b will switch
 back to node2. That is what I don't want the cluster to do. No
 switchback... How can I do that?

crm_attribute --type crm_config --attr-name default-resource-stickiness
--attr-value INFINITY

Old versions might need _ instead of - in the attr-name. If yours
does, take that as a hint that you should upgrade your cluster software.

 And which command is used to switch one resource group to another node
 (not marking any node as standby)?

crm_resource

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pingd/pacemaker

2009-03-31 Thread Dominik Klein
Michael Schwartzkopff wrote:
 Hi,
 
 I am testing the pingd from the provider pacemaker. As Dominik told me, there 
 is no need to define ping nodes in the ha.cf any more. OK so far.
 
 As I see pingd tries to reach all pingnodes of the hostlist attribute every 
 10 
 seconds. Is it possible to pass an attribute to the pingd deamon to have it 
 sending out ICMP echo request every second of every 3 seconds?
 
 Thanks.
 

interval ;)

Take a look at the metadata for available parameters.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pingd/pacemaker

2009-03-31 Thread Dominik Klein
Michael Schwartzkopff wrote:
 Am Dienstag, 31. März 2009 15:27:47 schrieb Dominik Klein:
 Michael Schwartzkopff wrote:
 Hi,

 I am testing the pingd from the provider pacemaker. As Dominik told me,
 there is no need to define ping nodes in the ha.cf any more. OK so far.

 As I see pingd tries to reach all pingnodes of the hostlist attribute
 every 10 seconds. Is it possible to pass an attribute to the pingd deamon
 to have it sending out ICMP echo request every second of every 3 seconds?

 Thanks.
 interval ;)

 Take a look at the metadata for available parameters.

 Regards
 Dominik
 
 Perhaps I am blind, but I also thought about that. Doesn't work for me. 
 Please 
 can anybody help me. My pingd clone is:
 
  primitive class=ocf id=clone_ping-primitive provider=pacemaker 
 type=pingd
   meta_attributes id=clone_ping-primitive-meta_attributes/
   operations id=clone_ping-primitive-operations
 op enabled=true id=clone_ping-primitive-operations-op 
 interval=3s name=monitor on-fail=ignore requires=nothing start-
 delay=1m timeout=20/

That's the interval of your monitor operation.

   /operations
   instance_attributes id=clone_ping-primitive-instance_attributes
 nvpair id=nvpair-48efdbbc-ee3f-4382-b8ac-10200bdc6ca1 
 name=host_list value=192.168.188.2 192.168.188.3/
 nvpair id=nvpair-edb09383-2f1c-4d1c-b671-23598109fbeb 
 name=dampen 
 value=3s/
 nvpair id=nvpair-ce8cb00a-9406-447f-8c08-af65fae26ff4 
 name=multiplier value=100/

nvpair id=pingd-interval name=interval value=3/

   /instance_attributes
 /primitive
   /clone
 
 When I tcpdump on the interface I see an ICMP echo request all 10 seconds on 
 the line.

When I said look at the metadata, I meant:

export OCF_ROOT=/usr/lib/ocf
/usr/lib/ocf/resource.d/pacemaker/pingd meta-data

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Beginner questions

2009-03-24 Thread Dominik Klein
Juha Heinanen wrote:
 Juha Heinanen writes:
 
   the real problem is that start of mysql server by pacemaker stops
   altogether after a few manual stops (/etc/init.d/mysql stop).
 
 i think i figured this out.  when pacemaker needed to start my
 mysql-server resource three times on node lenny1, it migrated the group
 to node lenny2.  when i then repeated stoping of mysql-server on lenny2,
 it migrated the group back to lenny1, but didn't start mysql-server,
 because it remembered that it had already started it there 3 times.
 
 if so, my conclusion is to forget migration-threshold parameter.

That sounds about right.

You can configure a failure-timeout. That's an amount of time after
which the cluster forgets about failures.

Read up on failure timeout and don't miss the section how to ensure
time based rules take effect in the pdf documentation.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Beginner questions

2009-03-23 Thread Dominik Klein
Les Mikesell wrote:
 My first HA setup is for a squid proxy where all I need is to move an IP
 address to a backup server if the primary fails (and the cache can just
 rebuild on its own).  This seems to work, but will only fail if the
 machine goes down completely or the primary IP is unreachable.  Is that
 typical or are there monitors for the service itself so failover would
 happen if the squid process is not running or stops accepting connections?
 
 Second question (unrelated):  Can heartbeat be set up so one or two
 spare machines could automatically take over the IP address of any of a
 much larger pool of machines that might fail?
 

Heartbeat in v1 mode (haresources configuration) cannot do any resource
level monitoring itself. You'd need to do that externally by any means.

If you're just starting out learning now, I'd suggest going with openais
and pacemaker instead of heartbeat right away. Check out the
documentation on www.clusterlabs.org/wiki/install and
www.clusterlabs.org/wiki/Documentation

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Beginner questions

2009-03-23 Thread Dominik Klein
Juha Heinanen wrote:
 Dominik Klein writes:
 
   Heartbeat in v1 mode (haresources configuration) cannot do any resource
   level monitoring itself. You'd need to do that externally by any
   means.
 
 yes, in v2 mode i have managed to make pacemaker to monitor resources,
 for example, like this:
 
 primitive test lsb:test \
   op monitor interval=30s timeout=5s \
   meta target-role=Started
 
 but i still have failed to find out how to make pacemaker to migrate
 a resource group to another node if one of the resources in the group
 fails to start.
 
 for example, if test is the last member of group
 
 group test-group fs0 mysql-server virtual-ip test
 
 and fails to start, the group is not migrated to another node.
 
 i have tried to add 
 
 primitive test lsb:test op monitor interval=30s timeout=5s meta 
 migration-threshold=3
 
 but it just stopped monitoring of test after 3 attempts.
 
 any ideas how to achieve migration?

I read your email on the pacemaker list and from what you've shared and
explained, i cannot spot find a configuration issue. It should just work
like that (and does work like that for me).

Maybe post your entire configuration, preferrably a hb_report archive.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] maintenance-mode of pengine

2009-03-23 Thread Dominik Klein
Michael Schwartzkopff wrote:
 Hi,
 
 In the metadata of the pengine I found the attribute maintenance-mode. I did 
 not find any documentation about it. The long description also says: Should 
 the cluster  Anybody knows what this options does?
 
 Thanks.

It disables resource management when set to true. Like
is-managed-default did in the old days, plus, irrc, it also disables
all ops. But better let Andrew verify the latter.

Regards
Dominik


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] expected-quorum-votes

2009-03-23 Thread Dominik Klein
 crmd metadata tells me that expected-quorum-votes
 are used to calculate quorum in openais based clusters. Its default value is 
 2. Do I have to change this value if I have 3 or more nodes in a OpenAIS 
 based 
 cluster?

No. It is automatically adjusted by the cluster.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat degrades drbd resource

2009-03-23 Thread Dominik Klein
You cannot use drbd in heartbeat the way you configured it.

Please refer to http://wiki.linux-ha.org/DRBD/HowTov2 and (if that
wasn't made clear enough on the page) make sure the first thing you do
is upgrade your cluster software. Read here on how to do that:
http://clusterlabs.org/wiki/Install

Regards
Dominik

Heiko Schellhorn wrote:
 Hi
 
 I installed drbd (8.0.14) together with heartbeat (2.0.8) on a Gentoo-system.
 
 I have following problem:
 Standalone the drbd resource works perfectly. I can mount/unmount it 
 alternate 
 on both nodes. Reading/writing works and /proc/drbd looks fine.
 
 But when I start heartbeat it degrades the resource step by step until it's 
 marked as unconfigured. An excerpt of the logfile is attached.
 Heartbeat itself starts up and runs. Two of the three resources configured up 
 to now are also working. Only drbd shows problems. (See the file  
 crm_mon-out)
 
 I don't think it's a problem of communication between the nodes because drbd 
 is working standalone and e.g. the IPaddr2 resource is also working within 
 heartbeat.
 I also tried several heartbeat-configurations. First I defined the resources 
 as single resources and then I combined the resources to a resource group.
 There was no difference.
 
 Has someone seen such an issue before? Any ideas ?
 I didn't find anything helpful in the list archive.
 
 If you need more informations I can provide a complete log and the config.
 
 Thanks
 
 Heiko
 
 
 
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat degrades drbd resource

2009-03-23 Thread Dominik Klein
Dominik Klein wrote:
 You cannot use drbd in heartbeat the way you configured it.
 
 Please refer to http://wiki.linux-ha.org/DRBD/HowTov2 

Sorry, copy/paste error. I meant to say

http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Beginner questions

2009-03-23 Thread Dominik Klein
 Is there some documentation available for openais?  I can't even find a
 good description of what it does or why you would use it.  Also, will
 this help with my 2nd question: having a few spares for a large number
 of servers?  While my objective with the squid cache is to proxy
 everything through one server to maximize the cache hits, I may switch
 to memcached on a group of machines and would like to have a standby or
 2 that could take over for any failing machine.

Well, there are man-pages and the mailing list. The install page even
has a configuration example. And I have found this thread to be
especially helpful:
https://lists.linux-foundation.org/pipermail/openais/2009-March/010894.html

openais will be the future platform for pacemaker clusters providing the
communication infrastructure and node failure detection.

Heartbeat will for example no longer be part of the next suse enterprise
linux (sles11) ha solution. It will be based on openais. So for new
setups, this should be the way to go - at least imho.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] drbd RA issue in (heartbeat 2.1.4 + drbd-8.3.0)

2009-03-19 Thread Dominik Klein
Dejan Muhamedagic wrote:
 Hi,
 
 On Wed, Mar 18, 2009 at 11:37:27AM -0700, Neil Katin wrote:

 Dejan Muhamedagic wrote:
 Hi,

 On Tue, Mar 17, 2009 at 11:56:04AM +0530, Arun G wrote:
 Hi,
  I observed below error message when I upgraded drbd to drbd-8.3.0 in
 heartbeat 2.1.4 cluster on 2.6.18-8.el5xen.
 -- snip --

 Thanks for the patch. But do all supported drbd versions have the
 role command?

 Thanks,

 Dejan
 No, only 8.3 has the change.  8.2 supports the old state argument, but
 prints a warning message out, and this warning message upsets the drbd OCF
 scripts parting of drbdadm's output.
 
 Since versions before 8.3 don't have the role command, I suppose
 that 8.3 actually prints the warning.
 
 drbdadm doesn't support a --version argument, but it does support a status
 command, which has version info in it.  However, I am not sure if drbdadm 
 status
 is guaranteed to not block or not, so I didn't want to have the OCF script 
 depend
 on it.
 
 drbdadm  | grep Version
 
 works for 8.2.7 and 8.0.14, so I guess that it is available in
 other versions too.
 
 So, I see three alternatives: add a new script drbdadm8.3.  Add an extra 
 parameter
 saying use role instead of status.  Or call drbdadm status to 
 dynamically detect
 our version.

 Do you see other choices?  Do you have a preference for a particular 
 alternative?
 I'm willing to code and test the patch if we can decide what we want.
 
 Let's see if we can figure out the version. Adding new RA would
 be a maintenance issue. Adding new parameter would make
 configuration depend on particular release.
 
 We could do something like this:
 
 drbdadm  | grep Version | awk '{print $2}' |
 awk -F. '
   $1 != 8 { exit 2; }

This should also allow version 7. People may still use v7. The drbdadm |
grep thing also works. Tested with latest v7 in a vm.

It prints

# drbdadm  | grep Version | awk '{print $2}'
0.7.25

though.

Regards
Dominik

   $2  3 { exit 1; } # use status
   # otherwise use role
 '
 rc=$?
 if [ $rc -eq 2 ]; then
   error installed (unsupported version)
 elif [ $rc -eq 1 ]; then
   cmd=status
 else
   cmd=role
 fi
 
 Could you please try this out. I can't test this right now.
 
 Thanks,
 
 Dejan
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pingnodes in openais

2009-03-18 Thread Dominik Klein
Michael Schwartzkopff wrote:
 Hi,
 
 As far as I know pingnodes have to be configured in heartbeat. heartbeat 
 pings 
 the nodes and updates the CIB.
 
 Where can I configure pingnodes, when I use OpenAIS as the cluster stack?

Create a pingd clone resource in the CIB. It's the preferred way of
running pingd anyway.

S.th. like

primitive pingd ocf:pacemaker:pingd \
params host_list=1.2.3.4 5.6.7.8 interval=5 dampen=10 \
op monitor interval=30 timeout=30
clone cl-pingd pingd

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Having issues with getting DRBD to work with Pacemaker

2009-03-09 Thread Dominik Klein
Hi

Jerome Yanga wrote:
 Dominik,
 
 As usual, you are right on the money.  I should have caught that myself.  
 Thank you for catching that for me.  What happened was that I used a 
 different server to compile DRBD and I had assumed that Nomen and Rubic (my 
 test nodes) were on the same kernel.
 
 Moreover, I had also combined Neil's suggestion to yours as he had mentioned 
 that pacemaker-1.0.1 and drbd-8.2 works.
 
 My current issues are as follows:
 1)  I cannot migrate the resource fs0 from Nomen to Rubric.  Running the 
 command  crm resource migrate fs0 just puts fs0 to offline state.  This 
 sounds like a config change.  NOTE:  I am planning to add fs0 into a Group 
 that will be able to migrate between the two nodes (Nomen and Rubric).  Help. 
  Please provide the crm(live) syntax as I have tried the ones below and crm 
 complains that the syntax is wrong.
 
 order ms-drbd0-before-fs0 mandatory: ms-drbd0:promote fs0:start
 colocation fs0-on-ms-drbd0 inf: fs0 ms-drbd0:Master

You need 1.0.2 for that. 1.0.1 packages' crm shell had a bug there.

 2)  Is there a documentation for what resources, constraints and the like I 
 can add into the cib.xml via crm(live)?  Moreover, their syntax to add them 
 via crm(live)?

http://clusterlabs.org/wiki/Documentation

--snip--

 cib.xml:
 
 cib admin_epoch=0 validate-with=pacemaker-1.0 crm_feature_set=3.0 
 have-quorum=1 epoch=153 num_updates=0 cib-last-written=Fri Mar  6 
 12:52:27 2009 dc-uuid=3a8b681c-a14b-4037-a8e6-2d4af2eff88e
   configuration
 crm_config
   cluster_property_set id=cib-bootstrap-options
 nvpair id=cib-bootstrap-options-dc-version name=dc-version 
 value=1.0.1-node: 6fc5ce8302abf145a02891ec41e5a492efbe8efe/
 nvpair id=cib-bootstrap-options-last-lrm-refresh 
 name=last-lrm-refresh value=1236213117/
   /cluster_property_set
 /crm_config
 nodes
   node id=3a8b681c-a14b-4037-a8e6-2d4af2eff88e uname=nomen.esri.com 
 type=normal/
   node id=a5e95310-f27d-418e-9cb9-42e50310f702 uname=rubric.esri.com 
 type=normal/
 /nodes
 resources
   master id=ms-drbd0
 meta_attributes id=ms-drbd0-meta_attributes
   nvpair id=ms-drbd0-meta_attributes-clone-max name=clone-max 
 value=2/
   nvpair id=ms-drbd0-meta_attributes-notify name=notify 
 value=true/
   nvpair id=ms-drbd0-meta_attributes-globally-unique 
 name=globally-unique value=false/
   nvpair id=ms-drbd0-meta_attributes-target-role 
 name=target-role value=Started/
 /meta_attributes
 primitive class=ocf id=drbd0 provider=heartbeat type=drbd
   instance_attributes id=drbd0-instance_attributes
 nvpair id=drbd0-instance_attributes-drbd_resource 
 name=drbd_resource value=r0/
   /instance_attributes
   operations id=drbd0-ops
 op id=drbd0-monitor-59s interval=59s name=monitor 
 role=Master timeout=30s/
 op id=drbd0-monitor-60s interval=60s name=monitor 
 role=Slave timeout=30s/
   /operations
 /primitive
   /master
   primitive class=ocf id=VIP provider=heartbeat type=IPaddr
 instance_attributes id=VIP-instance_attributes
   nvpair id=VIP-instance_attributes-ip name=ip 
 value=10.50.26.250/
 /instance_attributes
 operations id=VIP-ops
   op id=VIP-monitor-5s interval=5s name=monitor timeout=5s/
 /operations
   /primitive
   primitive class=ocf id=fs0 provider=heartbeat type=Filesystem
 instance_attributes id=fs0-instance_attributes
   nvpair id=fs0-instance_attributes-fstype name=fstype 
 value=ext3/
   nvpair id=fs0-instance_attributes-directory name=directory 
 value=/data/
   nvpair id=fs0-instance_attributes-device name=device 
 value=/dev/drbd0/
 /instance_attributes
   /primitive
 /resources
 constraints/

You don't have any constraints, so migrate fs0 will fail and not take
drbd into account.

   /configuration
 /cib
 
 
 
 messages:
 ==
 Mar  6 12:56:07 nomen lrmd: [14509]: info: Resource Agent output: []
 Mar  6 12:56:08 nomen crm_shadow: [1551]: info: Invoked: crm_shadow
 Mar  6 12:56:08 nomen crm_shadow: [1565]: info: Invoked: crm_shadow
 Mar  6 12:56:08 nomen crm_resource: [1566]: info: Invoked: crm_resource -M -r 
 fs0
 Mar  6 12:56:09 nomen cib: [14508]: info: cib_process_request: Operation 
 complete: op cib_delete for section constraints 
 (origin=local/crm_resource/3): ok (rc=0)
 Mar  6 12:56:09 nomen haclient: on_event:evt:cib_changed
 Mar  6 12:56:09 nomen crmd: [14603]: info: abort_transition_graph: 
 need_abort:60 - Triggered transition abort (complete=1) : Non-status change
 Mar  6 12:56:09 nomen crmd: [14603]: info: need_abort: Aborting on change to 
 epoch
 Mar  6 12:56:09 nomen crmd: [14603]: info: do_state_transition: State 
 transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
 origin=abort_transition_graph ]
 Mar  6 12:56:09 nomen 

Re: [Linux-HA] Having issues with getting DRBD to work with Pacemaker

2009-03-04 Thread Dominik Klein
Hi

Jerome Yanga wrote:
 Hi!  I am having issues with getting DRBD to work with Pacemaker.  I can get 
 Pacemaker and DRBD run individually but not DRBD managed by Pacemaker.  I 
 tried following the instruction in the site below but the resources will not 
 go online.
 
 http://clusterlabs.org/wiki/DRBD_HowTo_1.0
 
 Below is my configuration.
 
 Installed applications:
 ===
 kernel-2.6.18-128.el5

copy that

 drbd-8.3.0-3
 heartbeat-2.99.2-6.1
 pacemaker-1.0.1-3.1
 
 
 
 drbd.conf:
 ==
 global {
 usage-count no;
 }
 
 resource r0 {
   protocol C;
   handlers {
 pri-on-incon-degr echo o  /proc/sysrq-trigger ; halt -f;
 pri-lost-after-sb echo o  /proc/sysrq-trigger ; halt -f;
 local-io-error echo o  /proc/sysrq-trigger ; halt -f;
 outdate-peer /usr/lib/heartbeat/drbd-peer-outdater -t 5;
 pri-lost echo pri-lost. Have a look at the log files. | mail -s 'DRBD 
 Alert' root;
 out-of-sync /usr/lib/drbd/notify-out-of-sync.sh root;
   }
   startup {
  wfc-timeout  0;
   }
 
   disk {
 on-io-error   pass_on;
   }
   net {
  max-buffers 2048;
 after-sb-0pri disconnect;
 after-sb-1pri disconnect;
 after-sb-2pri disconnect;
 rr-conflict disconnect;
   }
   syncer {
 rate 100M;
 al-extents 257;
   }
   on nomen.esri.com {
 device /dev/drbd0;
 disk   /dev/sda5;
 address192.168.0.1:7789;
 meta-disk  internal;
   }
   on rubric.esri.com {
 device/dev/drbd0;
 disk  /dev/sda5;
 address   192.168.0.2:7789;
 meta-disk internal;
   }
 }
 
 
 
 Cib.xml:
 
 cib admin_epoch=0 validate-with=pacemaker-1.0 crm_feature_set=3.0 
 have-quorum=1 dc-uuid=a5
 e95310-f27d-418e-9cb9-42e50310f702 epoch=56 num_updates=0 
 cib-last-written=Wed Mar  4 14:27:59
  2009
   configuration
 crm_config
   cluster_property_set id=cib-bootstrap-options
 nvpair id=cib-bootstrap-options-dc-version name=dc-version 
 value=1.0.1-node: 6fc5ce830
 2abf145a02891ec41e5a492efbe8efe/
   /cluster_property_set
 /crm_config
 nodes
   node id=3a8b681c-a14b-4037-a8e6-2d4af2eff88e uname=nomen.esri.com 
 type=normal/
   node id=a5e95310-f27d-418e-9cb9-42e50310f702 uname=rubric.esri.com 
 type=normal/
 /nodes
 resources
   master id=ms-drbd0
 meta_attributes id=ms-drbd0-meta_attributes
   nvpair id=ms-drbd0-meta_attributes-clone-max name=clone-max 
 value=2/
   nvpair id=ms-drbd0-meta_attributes-notify name=notify 
 value=true/
   nvpair id=ms-drbd0-meta_attributes-globally-unique 
 name=globally-unique value=false
 /
   nvpair name=target-role 
 id=ms-drbd0-meta_attributes-target-role value=Started/
 /meta_attributes
 primitive class=ocf id=drbd0 provider=heartbeat type=drbd
   instance_attributes id=drbd0-instance_attributes
 nvpair id=drbd0-instance_attributes-drbd_resource 
 name=drbd_resource value=r0/
   /instance_attributes
   operations id=drbd0-ops
 op id=drbd0-monitor-59s interval=59s name=monitor 
 role=Master timeout=30s/
 op id=drbd0-monitor-60s interval=60s name=monitor 
 role=Slave timeout=30s/
   /operations
 /primitive
   /master
 /resources
 constraints/
   /configuration
 /cib
 
 
 /var/log/messages:
 ==
 Mar  4 14:27:58 nomen crm_resource: [30167]: info: Invoked: crm_resource 
 --meta -r ms-drbd0 -p target-role -v Started
 Mar  4 14:27:58 nomen cib: [29899]: info: cib_process_xpath: Processing 
 cib_query op for 
 //cib/configuration/resources//*...@id=ms-drbd0]//meta_attributes//nvpa...@name=target-role]
  (/cib/configuration/resources/master/meta_attributes/nvpair[4])
 Mar  4 14:27:59 nomen crmd: [29903]: info: do_lrm_rsc_op: Performing 
 key=5:5:0:d4b86e31-ca4a-4033-8437-6486622eb19f op=drbd0:0_start_0 )
 Mar  4 14:27:59 nomen haclient: on_event:evt:cib_changed
 Mar  4 14:27:59 nomen lrmd: [29900]: info: rsc:drbd0:0: start
 Mar  4 14:27:59 nomen cib: [30168]: info: write_cib_contents: Wrote version 
 0.56.0 of the CIB to disk (digest: 2365d9802f1b9c55e0ed87b8ebda5db3)
 Mar  4 14:27:59 nomen cib: [30168]: info: retrieveCib: Reading cluster 
 configuration from: /var/lib/heartbeat/crm/cib.xml (digest: 
 /var/lib/heartbeat/crm/cib.xml.sig)
 Mar  4 14:27:59 nomen cib: [29899]: info: Managed write_cib_contents process 
 30168 exited with return code 0.
 Mar  4 14:27:59 nomen modprobe: FATAL: Module drbd not found.
 Mar  4 14:27:59 nomen lrmd: [29900]: info: RA output: (drbd0:0:start:stdout)
 Mar  4 14:27:59 nomen mgmtd: [29904]: info: CIB query: cib
 Mar  4 14:27:59 nomen lrmd: [29900]: info: RA output: (drbd0:0:start:stdout) 
 Could not stat(/proc/drbd): No such file or directory do you need to load 
 the module? try: modprobe drbd Command 'drbdsetup /dev/drbd0 disk /dev/sda5 
 /dev/sda5 internal --set-defaults --create-device --on-io-error=pass_on' 
 terminated with exit code 20 drbdadm attach 

[Linux-HA] showscores.sh for pacemaker 1.0.2

2009-03-03 Thread Dominik Klein
Hi

I made the necessary changes to the showscores script to work with
pacemaker 1.0.2.

Please test and report problems. Has been reported to work by some
people and should go into the repository soon. Still, I'd like more
people to test and confirm.

Important changes:
* correctly fetch stickiness and migration-threshold for complex
resources (master and clone)
* adjust column-width according to the length of resources' and nodes' names

Regards
Dominik


showscores.sh
Description: Bourne shell script
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] showscores for pacamaker-1.0

2009-03-02 Thread Dominik Klein
 showscores gives me:
 ~# ./showscores.sh 
 ResourceScore NodeStickiness #FailFail-
 Stickiness 
 50  0 
 
 on  50
 
 on  50
 
 on  50
 
 on  50
 
 on  50
 
 on  50
 
 on  50
 
 on  50   
 
 not the result I expected originally.
 
 @Dominik: Any chance to fix the script?
 
 ptest -s
 
 That is the kind of program I was looking for. Is there any explanation how 
 group_color and native_color are defined? Is there any update the the
 linux-ha.org/ScoreCalculation
 page for pacemaker =1.0?
 
 Thanks for enlightening answers.

I have a version that deals with most of the new things. I will post it
here soon.

If you want to test now, send me a private email.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HA debug message

2009-02-16 Thread Dominik Klein
Tears ! wrote:
 Dear members!
 I have first time Install heartbeat on Slackware 12.2. I have
 enable debugging in ha.cf
 
 Here is the some debug message i want to describe here.
 
 Feb 14 23:01:15 haServer1 heartbeat: [15131]: WARN: Core dumps could be lost
 if multiple dumps occur.
 Feb 14 23:01:15 haServer1 heartbeat: [15131]: WARN: Consider setting
 non-default value in /proc/sys/kernel/core_pattern (or equivalent) for
 maximum supportability
 Feb 14 23:01:15 haServer1 heartbeat: [15131]: WARN: Consider setting
 /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum
 supportability

What about Consider setting /proc/sys/kernel/core_uses_pid (or
equivalent) to 1 for maximum supportability do you not understand?

 Now i just want to ask you are these message are realy serious? if yes then
 what should i do?

They're not serious in a meaning you have done anything wrong or
something is not working correctly. It's just a suggestion. Set those
parameters and the developers might be able to debug your problems
easier in case you hit coredumps.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-12 Thread Dominik Klein
Did you look into the returncodes and eventually tell linbit about it?

That would be a big issue.

Regards
Dominik

jayfitzpatr...@gmail.com wrote:
 (Big Cheers and celebrations from this end!!!)
 
 Finally figured out what the problem was, it seems that the kernel oops
 were being caused by the 8.3 version of DRBD, once downgraded to 8.2.7
 everything started to work as it should. primary / secondary automatic
 fail over is in place and resources are now following the DRBD master!
 
 Thanks a mill for all the help.
 
 Jason
 
 On Feb 12, 2009 8:48am, Jason Fitzpatrick jayfitzpatr...@gmail.com wrote:
 Hi Dominik

 thanks again for the feedback,

 I had noticed some kernel opps's since the last kernel update that i and 
 they seem to be pointing to DRBD, i will downgrade the kernel again and
 see if this improves things,


 re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9 
 but must have missed this bit.

 user land and kernel module all report the same version.

 I am on my way into the office now and I will apply the changes once
 there


 thanks again

 Jason

 2009/2/12 Dominik Klein d...@in-telegence.net

 Right, this one looks better.



 I'll refer to nodes as 1001 and 1002.



 1002 is your DC.

 You have stonith enabled, but no stonith devices. Disable stonith or get

 and configure a stonith device (_please_ dont use ssh).



 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l

 978). Retries in l 1018 and fails again in l 1035.



 Then, the cluster tries to start drbd on 1001 in l 1079, followed by a

 bunch of kernel messages I don't understand (pretty sure _this_ is the

 first problem you should address!), ending up in the drbd RA not able to

 see the secondary state (1449) and considering the start failed.



 The RA code for this is

 if do_drbdadm up $RESOURCE ; then

 drbd_get_status

 if [ $DRBD_STATE_LOCAL != Secondary ]; then

 ocf_log err $RESOURCE start: not in Secondary mode after start.

 return $OCF_ERR_GENERIC

 fi

 ocf_log debug $RESOURCE start: succeeded.

 return $OCF_SUCCESS

 else

 ocf_log err $RESOURCE: Failed to start up.

 return $OCF_ERR_GENERIC

 fi



 The cluster then successfully stops drbd again (l 1508-1511) and tries

 to start the other clone instance (l 1523).



 Log says

 RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device

 is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0

 disk /dev/sdb /dev/sdb internal --set-defaults --create-device

 --on-io-error=pass_on' terminated with exit code 10


 Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in

 Secondary mode after start.



 So this is interesting. Although stop (basically drbdadm down)

 succeeded, the drbd device is still attached!



 Please try:

 stop the cluster

 drbdadm up $resource

 drbdadm up $resource #again

 echo $?

 drbdadm down $resource

 echo $?

 cat /proc/drbd



 Btw: Does your userland match your kernel module version?



 To bring this to an end: The start of the second clone instance also

 failed, so both instances are unrunnable on the node and no further

 start is tried on 1002.



 Interestingly, then (could not see any attempt before), the cluster

 wants to start drbd on node 1001, but it also fails and also gives those

 kernel messages. In l 2001, each instance has a failed start on each
 node.



 So: Find out about those kernel messages. Can't help much on that

 unfortunately, but there were some threads about things like that on

 drbd-user recently. Maybe you can find answers to that problem there.



 And also: please verify returncodes of drbdadm in your case. Maybe

 that's a drbd tools bug? (can't say for sure, for me, up on an alreay up

 resource gives 1, which is ok).



 Regards

 Dominik



 Jason Fitzpatrick wrote:


  it seems that I had the incorrect version of openais installed (from
 the

  fedora repo vs the HA one)

 

  I have corrected and the hb_report ran correctly using the following

 

  hb_report -u root -f 3pm /tmp/report

 

  Please see attached

 

  Thanks again

 

  Jason





 ___

 Linux-HA mailing list

 Linux-HA@lists.linux-ha.org

 http://lists.linux-ha.org/mailman/listinfo/linux-ha

 See also: http://linux-ha.org/ReportingProblems






 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Is it possible to cleanly take down a resource in a v1 config?

2009-02-12 Thread Dominik Klein
Hi

heartbeat in v1 mode does not do resource monitoring by itself. So if
you did not set up any custom resource monitoring, you can just stop
your application in whatever way you normally do that and re-start it
whenever you like.

v1 clusters will not notice. They only see node state changes.

Regards
Dominik

Malcolm Turnbull wrote:
 With a two node linux-ha cluster you can add an ip address to
 haresources and then do a hb_takeover, and it will bring the interface
 up cleanly.
 But is their a way of taking down an resource cleanly? (just removing
 it from haresources and re-booting not being a good answer)
 or do you need to do a manual ifconfig eth0:x down...
 and if so how do you evaluate which x it is? (awk I guess?)
 
 Thanks.
 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-ha-dev] Patch: RA anything

2009-02-11 Thread Dominik Klein
Hi

I fixed most of the things Lars mentioned in
http://hg.linux-ha.org/dev/rev/15bcf3491f9c and will explain why I did
not fix some of them. ocf-tester runs fine with the RA.

 # FIXME: This should use pidofproc

pidofproc is not available everywhere and is not able to get down to
command line options, eg could not tell the difference between $process
$option_a and $process $option_b which I wanted to support with this
agent.

Example:
dktest3:~/src/linuxha/hg/dev # sleep 200 
[1] 5799
dktest3:~/src/linuxha/hg/dev # sleep 300 
[2] 5801
dktest3:~/src/linuxha/hg/dev # pidofproc sleep
5801 5799
dktest3:~/src/linuxha/hg/dev # pidofproc sleep 300

 # FIXME: use start_daemon

start_daemon is not available everywhere either.

 # FIXME: What about daemons which can manage their own pidfiles?

This agent is meant to be used for programs that are not actually
daemons by design. It is meant to be able to run sth stupid in the
cluster. Even like /bin/sleep 1000

 # FIXME: use killproc

This is also a problem with $process $option_a and $process
$option_b. You can't just killproc $process then.

 # FIXME: Attributes special meaning to the resource id

I tried to, but couldn't understand what you meant here.

I also talked to Dejan on IRC and we agreed that anything is a bad
name for the RA and the changeset description was propably bad, too.
This RA is not for (as the cs stated) arbitrary daemons, it is more
for daemonizing programs which were not meant to be daemons.

If a proper name comes to anyone's mind - please share.

Hopefully, now it is a bit clearer what I wanted to be able to do with
this RA. I agree the cmd= lines and pid file creation are very very
ugly, but I could not yet find a better way. Not that much of a shell
genius I guess :( Please share if you can improve things.

Regards
Dominik
exporting patch:
# HG changeset patch
# User Dominik Klein d...@in-telegence.net
# Date 1234350091 -3600
# Node ID 04533b37813c8be009814f52de7b14ff65bf9862
# Parent  90ff997faa7288248ac57583b0c03df4c8e41bda
RA: anything. Implement most of lmbs suggestions.

diff -r 90ff997faa72 -r 04533b37813c resources/OCF/anything
--- a/resources/OCF/anything	Wed Feb 11 11:31:02 2009 +0100
+++ b/resources/OCF/anything	Wed Feb 11 12:01:31 2009 +0100
@@ -32,6 +32,7 @@
 #   OCF_RESKEY_errlogfile
 #   OCF_RESKEY_user
 #   OCF_RESKEY_monitor_hook
+#   OCF_RESKEY_stop_timeout
 #
 # This RA starts $binfile with $cmdline_options as $user and writes a $pidfile from that. 
 # If you want it to, it logs:
@@ -47,18 +48,20 @@
 # Initialization:
 . ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs
 
-getpid() { # make sure that the file contains a number
-	# FIXME: pidfiles could contain spaces
-	grep '^[0-9][0-9]*$' $1
+getpid() {
+grep -o '[0-9]*' $1
 }
 
 anything_status() {
-	# FIXME: This should use pidofproc
-	# FIXME: pidfile w/o process means the process died, so should
-	# be ERR_GENERIC
-	if test -f $pidfile  pid=`getpid $pidfile`  kill -0 $pid
+	if test -f $pidfile
 	then
-		return $OCF_RUNNING
+		if pid=`getpid $pidfile`  kill -0 $pid
+		then
+			return $OCF_RUNNING
+		else
+			# pidfile w/o process means the process died
+			return $OCF_ERR_GENERIC
+		fi
 	else
 		return $OCF_NOT_RUNNING
 	fi
@@ -66,8 +69,6 @@
 
 anything_start() {
 	if ! anything_status
-	# FIXME: use start_daemon
-	# FIXME: What about daemons which can manage their own pidfiles?
 	then
 		if [ -n $logfile -a -n $errlogfile ]
 		then
@@ -101,29 +102,48 @@
 }
 
 anything_stop() {
-	# FIXME: use killproc
+if [ -n $OCF_RESKEY_stop_timeout ]
+then
+stop_timeout=$OCF_RESKEY_stop_timeout
+elif [ -n $OCF_RESKEY_CRM_meta_timeout ]; then
+# Allow 2/3 of the action timeout for the orderly shutdown
+# (The origin unit is ms, hence the conversion)
+stop_timeout=$((OCF_RESKEY_CRM_meta_timeout/1500))
+else
+stop_timeout=10
+fi
 	if anything_status
 	then
-		pid=`getpid $pidfile`
-		kill $pid
-		i=0
-		# FIXME: escalate to kill -9 before timeout
-		while sleep 1 
-		do
-			if ! anything_status
-			then
-rm -f $pidfile  /dev/null 21
-return $OCF_SUCCESS
-			fi
-			let i++
-		done
+pid=`getpid $pidfile`
+kill $pid
+rm -f $pidfile
+i=0
+while [ $i -lt $stop_timeout ]
+do
+while sleep 1 
+do
+if ! anything_status
+then
+return $OCF_SUCCESS
+fi
+let i++
+done
+done
+ocf_log warn Stop with SIGTERM failed/timed out, now sending SIGKILL.
+kill -9 $pid
+if ! anything_status

Re: [Linux-HA] failed dependencies while installing heartbeat 2.99.2-6.1

2009-02-11 Thread Dominik Klein
Gerd König wrote:
 Hi Dominik,
 
 thanks for answering quickly, but there were no dependencies found:
 
 #zypper search openipmi
 * Lese installierte Pakete [100%]
 Keine möglichen Abhängigkeiten gefunden.
 
 Do I need some additional software repositories ?

I don't think so. The packages should be available in your distribution.

Since you're using opensuse, you can also try yast software management
and see whether you can find the packages. They should come on the dvd I
think.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Failovercluster considered one node down but state transition did not happen succesfully

2009-02-11 Thread Dominik Klein
Zemke, Kai wrote:
 Hi,
 
  
 
 I'm running a two node failover cluster. Yesterday the cluster tried to 
 manage a state transition. In the log files I found the following entries:
 
  
 
 heartbeat[6905]: 2009/02/10_21:45:55 WARN: node nagios-drbd2: is dead
 
 heartbeat[6905]: 2009/02/10_21:45:55 info: Link nagios-drbd2:eth1 dead.
 
  
 
 A few minutes later the node that was still alive tried to take over the 
 resources and created the following entries in the log file ( the resource 
 ipaddress is an example, there are a lot more entries for the other 
 resources that were running on the cluster ):
 
  
 
 pengine[7370]: 2009/02/10_21:45:59 WARN: custom_action: Action 
 resource_nagios_ipaddress_stop_0 on nagios-drbd2 is unrunnable (offline)
 
 pengine[7370]: 2009/02/10_21:45:59 WARN: custom_action: Marking node 
 nagios-drbd2 unclean
 
  
 
 Further more there a several entries telling:
 
  
 
 stonithd[6916]: 2009/02/10_21:46:30 ERROR: Failed to STONITH the node 
 nagios-drbd2: optype=RESET, op_result=TIMEOUT
 
  
 
 The stonith is running via ssh on a direct link between the to nodes. Since 
 Node2 was down the shutdown command never reached its destination.

Which is why ssh stonith is not meant for production.

 My Questions are:
 
 Why did the alive cluster try to stop resources on a cluster node that is 
 considered as dead?
 
 Why did STONITH try to shut down a node that is considered down? ( for safety 
 reasons I think )

It is considered dead, but that does not have to be a fact. By shooting
it, the cluster makes the assumption a fact (turn it off or reboot it).

 Shouldn't the resources just be started on the alive node without any further 
 action?

Not until the cluster knows the other node is dead. Who knows what's
going on there if it cannot be communicated with.

 Did I miss something in the default behaviour of heartbeat? Maybe a timeout?
 
 Would a hardware STONITH device solve such problems in the future?

Yes.

Regards
Dominik
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-11 Thread Dominik Klein
Hi Jason

Jason Fitzpatrick wrote:
 I have disabled the services and run
 drbdadm secondary all
 drbdadm detach all
 drbdadm down all
 service drbd stop
 
 before testing as far as I can see (cat /proc/drbd on both nodes) drbd is
 shutdown
 
 cat: /proc/drbd: No such file or directory

Good.

 I have taken the command that heartbeat is running (drbdsetup /dev/drbd0
 disk /dev/sdb /dev/sdb internal --set-defaults --create-device
 --on-io-error=pass_on') 

The RA actually runs drbdadm up, which translates into this.

 and run it against the nodes when heartbeat is not
 in control and this command will bring the resources online, but re-running
 this command will generate the error, so I am kind of leaning twords the
 command being run twice?

Never seen the cluster do that.

Please post your configuration and logs. hb_report should gather
everything needed and put it into a nice .bz2 archive :)

Regards
Dominik

 Thanks
 
 Jason
 
 2009/2/11 Dominik Klein d...@in-telegence.net
 
 Hi Jason

 any chance you started drbd at boot or the drbd device was active at the
 time you started the cluster resource? If so, read the introduction of
 the howto again and correct your setup.

 Jason Fitzpatrick wrote:
 Hi Dominik

 I have upgraded to HB 2.9xx and have been following the instructions that
 you provided (thanks for those) and have added a resource as follows

 crm
 configure
 primitive Storage1 ocf:heartbeat:drbd \
 params drbd_resource=Storage1 \
 op monitor role=Master interval=59s timeout=30s \
 op monitor role=Slave interval=60s timeout=30s
 ms DRBD_Storage Storage1 \
 meta clone-max=2 notify=true globally-unique=false target-role=stopped
 commit
 exit

 no errors are reported and the resource is visable from within the hb_gui

 when I try to bring the resource online with

 crm resource start DRBD_Storage

 I see the resource attempt to come online and then fail, it seems to be
 starting the services, changing the status of the devices to attached
 (from
 detached) but not setting any device to master

 the following is from the ha-log

 crmd[8020]: 2009/02/10_17:22:32 info: do_lrm_rsc_op: Performing
 key=7:166:0:b57f7f7c-4e2d-4134-9c14-b1a2b7db11a7 op=Storage1:1_start_0 )
 lrmd[8016]: 2009/02/10_17:22:32 info: rsc:Storage1:1: start
 lrmd[8016]: 2009/02/10_17:22:32 info: RA output:
 (Storage1:1:start:stdout)
 /dev/drbd0: Failure: (124) Device is attached to a disk (use detach
 first)
 Command
  'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults
 --create-device --on-io-error=pass_on' terminated with exit code 10
 This looks like drbdadm up is failing because the device is already
 attached to the lower level storage device.

 Regards
 Dominik

 drbd[22270]:2009/02/10_17:22:32 ERROR: Storage1 start: not in
 Secondary
 mode after start.
 crmd[8020]: 2009/02/10_17:22:32 info: process_lrm_event: LRM operation
 Storage1:1_start_0 (call=189, rc=1, cib-update=380, confirmed=true)
 complete
 unknown e
 rror
 .

 I have checked the DRBD device Storage1 and it is in secondary mode after
 the start, and should I choose I can make it primary on either node

 Thanks

 Jason

 2009/2/10 Jason Fitzpatrick jayfitzpatr...@gmail.com

 Thanks,

 This was the latest version in the Fedora Repos, I will upgrade and see
 what happens

 Jason

 2009/2/10 Dominik Klein d...@in-telegence.net

 Jason Fitzpatrick wrote:
 Hi All

 I am having a hell of a time trying to get heartbeat to fail over my
 DRBD
 harddisk and am hoping for some help.

 I have a 2 node cluster, heartbeat is working as I am able to fail
 over
 IP
 Addresses and services successfully, but when I try to fail over my
 DRBD
 resource from secondary to primary I am hitting a brick wall, I can
 fail
 over the DRBD resource manually so I know that it does work at some
 level
 DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386
 Please upgrade. Thats too old for reliable master/slave behaviour.
 Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read
 http://www.clusterlabs.org/wiki/Install for install notes.

 and using
 heartbeat-gui to configure
 Don't use the gui to configure complex (ie clone or master/slave)
 resources.

 Once you upgraded to the latest pacemaker, please refer to
 http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster
 configuration.

 Regards
 Dominik

 DRBD Resource is called Storage1, the 2 nodes are connected via 2
 x-over
 cables (1 heartbeat 1 Replication)

 I have stripped down my config to the bare bones and tried every
 option
 that I can think off but know that I am missing something simple,

 I have attached my cib.xml but have removed domain names from the
 systems
 for privacy reasons

 Thanks in advance

 Jason

  cib admin_epoch=0 have_quorum=true ignore_dtd=false
 cib_feature_revision=2.0 num_peers=2 generated=true
 ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48
 epoch=733 num_updates=1 cib-last-written=Mon Feb  9 18:31:19
 2009
configuration
  crm_config

Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-11 Thread Dominik Klein
The archive only contains info for one node and the logfile is empty.
Did you use appropriate -f time and does ssh work between the nodes?

So far, nothing obvious to me except for the order between your FS and
DRBD lacking the role definition, but that's not what your problem is
about (yet *g*).

Regards
Dominik

Jason Fitzpatrick wrote:
 Hi Dominik
 
 Thanks for the follow up, please find the file attached
 
 Jason
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-11 Thread Dominik Klein
Right, this one looks better.

I'll refer to nodes as 1001 and 1002.

1002 is your DC.
You have stonith enabled, but no stonith devices. Disable stonith or get
and configure a stonith device (_please_ dont use ssh).

1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l
978). Retries in l 1018 and fails again in l 1035.

Then, the cluster tries to start drbd on 1001 in l 1079, followed by a
bunch of kernel messages I don't understand (pretty sure _this_ is the
first problem you should address!), ending up in the drbd RA not able to
see the secondary state (1449) and considering the start failed.

The RA code for this is
if do_drbdadm up $RESOURCE ; then
 drbd_get_status
 if [ $DRBD_STATE_LOCAL != Secondary ]; then
  ocf_log err $RESOURCE start: not in Secondary mode after start.
  return $OCF_ERR_GENERIC
 fi
 ocf_log debug $RESOURCE start: succeeded.
 return $OCF_SUCCESS
 else
  ocf_log err $RESOURCE: Failed to start up.
  return $OCF_ERR_GENERIC
fi

The cluster then successfully stops drbd again (l 1508-1511) and tries
to start the other clone instance (l 1523).

Log says
RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device
is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0
disk /dev/sdb /dev/sdb internal --set-defaults --create-device
--on-io-error=pass_on' terminated with exit code 10
Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in
Secondary mode after start.

So this is interesting. Although stop (basically drbdadm down)
succeeded, the drbd device is still attached!

Please try:
stop the cluster
drbdadm up $resource
drbdadm up $resource #again
echo $?
drbdadm down $resource
echo $?
cat /proc/drbd

Btw: Does your userland match your kernel module version?

To bring this to an end: The start of the second clone instance also
failed, so both instances are unrunnable on the node and no further
start is tried on 1002.

Interestingly, then (could not see any attempt before), the cluster
wants to start drbd on node 1001, but it also fails and also gives those
kernel messages. In l 2001, each instance has a failed start on each node.

So: Find out about those kernel messages. Can't help much on that
unfortunately, but there were some threads about things like that on
drbd-user recently. Maybe you can find answers to that problem there.

And also: please verify returncodes of drbdadm in your case. Maybe
that's a drbd tools bug? (can't say for sure, for me, up on an alreay up
resource gives 1, which is ok).

Regards
Dominik

Jason Fitzpatrick wrote:
 it seems that I had the incorrect version of openais installed (from the
 fedora repo vs the HA one)
 
 I have corrected and the hb_report ran correctly using the following
 
  hb_report -u root -f 3pm /tmp/report
 
 Please see attached
 
 Thanks again
 
 Jason

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-11 Thread Dominik Klein
Dominik Klein wrote:
 Right, this one looks better.
 
 I'll refer to nodes as 1001 and 1002.
 
 1002 is your DC.
 You have stonith enabled, but no stonith devices. Disable stonith or get
 and configure a stonith device (_please_ dont use ssh).
 
 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l
 978). Retries in l 1018 and fails again in l 1035.
 
 Then, the cluster tries to start drbd on 1001 in l 1079, 

s/1001/1002

sorry
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-10 Thread Dominik Klein
Jason Fitzpatrick wrote:
 Hi All

 I am having a hell of a time trying to get heartbeat to fail over my DRBD
 harddisk and am hoping for some help.

 I have a 2 node cluster, heartbeat is working as I am able to fail over IP
 Addresses and services successfully, but when I try to fail over my DRBD
 resource from secondary to primary I am hitting a brick wall, I can fail
 over the DRBD resource manually so I know that it does work at some level

 DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386 

Please upgrade. Thats too old for reliable master/slave behaviour.
Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read
http://www.clusterlabs.org/wiki/Install for install notes.

 and using
 heartbeat-gui to configure

Don't use the gui to configure complex (ie clone or master/slave) resources.

Once you upgraded to the latest pacemaker, please refer to
http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster
configuration.

Regards
Dominik

 DRBD Resource is called Storage1, the 2 nodes are connected via 2 x-over
 cables (1 heartbeat 1 Replication)

 I have stripped down my config to the bare bones and tried every option
 that I can think off but know that I am missing something simple,

 I have attached my cib.xml but have removed domain names from the systems
 for privacy reasons

 Thanks in advance

 Jason

  cib admin_epoch=0 have_quorum=true ignore_dtd=false
 cib_feature_revision=2.0 num_peers=2 generated=true
 ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48
 epoch=733 num_updates=1 cib-last-written=Mon Feb  9 18:31:19 2009
configuration
  crm_config
cluster_property_set id=cib-bootstrap-options
  attributes
nvpair id=cib-bootstrap-options-dc-version name=dc-version
 value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/
nvpair name=last-lrm-refresh
 id=cib-bootstrap-options-last-lrm-refresh value=1234204278/
  /attributes
/cluster_property_set
  /crm_config
  nodes
node id=df707752-d5fb-405a-8ca7-049e25a227b7 uname=lpissan1001
 type=normal
  instance_attributes
 id=nodes-df707752-d5fb-405a-8ca7-049e25a227b7
attributes
  nvpair id=standby-df707752-d5fb-405a-8ca7-049e25a227b7
 name=standby value=off/
/attributes
  /instance_attributes
/node
node id=9d8abc28-4fa3-408a-a695-fb36b0d67a48 uname=lpissan1002
 type=normal
  instance_attributes
 id=nodes-9d8abc28-4fa3-408a-a695-fb36b0d67a48
attributes
  nvpair id=standby-9d8abc28-4fa3-408a-a695-fb36b0d67a48
 name=standby value=off/
/attributes
  /instance_attributes
/node
  /nodes
  resources
master_slave id=Storage1
  meta_attributes id=Storage1_meta_attrs
attributes
  nvpair id=Storage1_metaattr_target_role name=target_role
 value=started/
  nvpair id=Storage1_metaattr_clone_max name=clone_max
 value=2/
  nvpair id=Storage1_metaattr_clone_node_max
 name=clone_node_max value=1/
  nvpair id=Storage1_metaattr_master_max name=master_max
 value=1/
  nvpair id=Storage1_metaattr_master_node_max
 name=master_node_max value=1/
  nvpair id=Storage1_metaattr_notify name=notify
 value=true/
  nvpair id=Storage1_metaattr_globally_unique
 name=globally_unique value=false/
/attributes
  /meta_attributes
  primitive id=Storage1 class=ocf type=drbd
 provider=heartbeat
instance_attributes id=Storage1_instance_attrs
  attributes
nvpair id=273a1bb2-4867-42dd-a9e5-7cebbf48ef3b
 name=drbd_resource value=Storage1/
  /attributes
/instance_attributes
operations
  op id=9ddc0ce9-4090-4546-a7d5-787fe47de872 name=monitor
 description=master interval=29 timeout=10 start_delay=1m
 role=Master/
  op id=56a7508f-fa42-46f8-9924-3b284cdb97f0 name=monitor
 description=slave interval=29 timeout=10 start_delay=1m
 role=Slave/
/operations
  /primitive
/master_slave
  /resources
  constraints/
/configuration
  /cib


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD in a 2 node cluster

2009-02-10 Thread Dominik Klein
Hi Jason

any chance you started drbd at boot or the drbd device was active at the
time you started the cluster resource? If so, read the introduction of
the howto again and correct your setup.

Jason Fitzpatrick wrote:
 Hi Dominik
 
 I have upgraded to HB 2.9xx and have been following the instructions that
 you provided (thanks for those) and have added a resource as follows
 
 crm
 configure
 primitive Storage1 ocf:heartbeat:drbd \
 params drbd_resource=Storage1 \
 op monitor role=Master interval=59s timeout=30s \
 op monitor role=Slave interval=60s timeout=30s
 ms DRBD_Storage Storage1 \
 meta clone-max=2 notify=true globally-unique=false target-role=stopped
 commit
 exit
 
 no errors are reported and the resource is visable from within the hb_gui
 
 when I try to bring the resource online with
 
 crm resource start DRBD_Storage
 
 I see the resource attempt to come online and then fail, it seems to be
 starting the services, changing the status of the devices to attached (from
 detached) but not setting any device to master
 
 the following is from the ha-log
 
 crmd[8020]: 2009/02/10_17:22:32 info: do_lrm_rsc_op: Performing
 key=7:166:0:b57f7f7c-4e2d-4134-9c14-b1a2b7db11a7 op=Storage1:1_start_0 )
 lrmd[8016]: 2009/02/10_17:22:32 info: rsc:Storage1:1: start
 lrmd[8016]: 2009/02/10_17:22:32 info: RA output: (Storage1:1:start:stdout)
 /dev/drbd0: Failure: (124) Device is attached to a disk (use detach first)
 Command
  'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal --set-defaults
 --create-device --on-io-error=pass_on' terminated with exit code 10

This looks like drbdadm up is failing because the device is already
attached to the lower level storage device.

Regards
Dominik

 drbd[22270]:2009/02/10_17:22:32 ERROR: Storage1 start: not in Secondary
 mode after start.
 crmd[8020]: 2009/02/10_17:22:32 info: process_lrm_event: LRM operation
 Storage1:1_start_0 (call=189, rc=1, cib-update=380, confirmed=true) complete
 unknown e
 rror
 .
 
 I have checked the DRBD device Storage1 and it is in secondary mode after
 the start, and should I choose I can make it primary on either node
 
 Thanks
 
 Jason
 
 2009/2/10 Jason Fitzpatrick jayfitzpatr...@gmail.com
 
 Thanks,

 This was the latest version in the Fedora Repos, I will upgrade and see
 what happens

 Jason

 2009/2/10 Dominik Klein d...@in-telegence.net

 Jason Fitzpatrick wrote:
 Hi All

 I am having a hell of a time trying to get heartbeat to fail over my
 DRBD
 harddisk and am hoping for some help.

 I have a 2 node cluster, heartbeat is working as I am able to fail over
 IP
 Addresses and services successfully, but when I try to fail over my
 DRBD
 resource from secondary to primary I am hitting a brick wall, I can
 fail
 over the DRBD resource manually so I know that it does work at some
 level
 DRBD version 8.3 Heartbeat version heartbeat-2.1.3-1.fc9.i386
 Please upgrade. Thats too old for reliable master/slave behaviour.
 Preferrably upgrade to pacemaker and ais or heartbeat 2.99. Read
 http://www.clusterlabs.org/wiki/Install for install notes.

 and using
 heartbeat-gui to configure
 Don't use the gui to configure complex (ie clone or master/slave)
 resources.

 Once you upgraded to the latest pacemaker, please refer to
 http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 for drbd's cluster
 configuration.

 Regards
 Dominik

 DRBD Resource is called Storage1, the 2 nodes are connected via 2
 x-over
 cables (1 heartbeat 1 Replication)

 I have stripped down my config to the bare bones and tried every option
 that I can think off but know that I am missing something simple,

 I have attached my cib.xml but have removed domain names from the
 systems
 for privacy reasons

 Thanks in advance

 Jason

  cib admin_epoch=0 have_quorum=true ignore_dtd=false
 cib_feature_revision=2.0 num_peers=2 generated=true
 ccm_transition=22 dc_uuid=9d8abc28-4fa3-408a-a695-fb36b0d67a48
 epoch=733 num_updates=1 cib-last-written=Mon Feb  9 18:31:19
 2009
configuration
  crm_config
cluster_property_set id=cib-bootstrap-options
  attributes
nvpair id=cib-bootstrap-options-dc-version
 name=dc-version
 value=2.1.3-node: 552305612591183b1628baa5bc6e903e0f1e26a3/
nvpair name=last-lrm-refresh
 id=cib-bootstrap-options-last-lrm-refresh value=1234204278/
  /attributes
/cluster_property_set
  /crm_config
  nodes
node id=df707752-d5fb-405a-8ca7-049e25a227b7
 uname=lpissan1001
 type=normal
  instance_attributes
 id=nodes-df707752-d5fb-405a-8ca7-049e25a227b7
attributes
  nvpair id=standby-df707752-d5fb-405a-8ca7-049e25a227b7
 name=standby value=off/
/attributes
  /instance_attributes
/node
node id=9d8abc28-4fa3-408a-a695-fb36b0d67a48
 uname=lpissan1002
 type=normal
  instance_attributes
 id=nodes-9d8abc28-4fa3-408a-a695-fb36b0d67a48
attributes
  nvpair id=standby-9d8abc28-4fa3-408a-a695-fb36b0d67a48

Re: [Linux-HA] failed dependencies while installing heartbeat 2.99.2-6.1

2009-02-10 Thread Dominik Klein
Gerd König wrote:
 Hello list,
 
 I wanted to start with heartbeat using the latest sources for
 OpenSuse10.3 64bit.
 I've downloaded these rpm's:
 
 heartbeat-2.99.2-6.1.x86_64.rpm
 heartbeat-common-2.99.2-6.1.x86_64.rpm
 heartbeat-debuginfo-2.99.2-6.1.x86_64.rpm
 heartbeat-resources-2.99.2-6.1.x86_64.rpm
 libheartbeat2-2.99.2-6.1.x86_64.rpm
 libopenais2-0.80.3-12.2.x86_64.rpm
 libpacemaker3-1.0.1-3.1.x86_64.rpm
 libpacemaker-devel-1.0.1-3.1.x86_64.rpm
 openais-0.80.3-12.2.x86_64.rpm
 pacemaker-1.0.1-3.1.x86_64.rpm
 pacemaker-debuginfo-1.0.1-3.1.x86_64.rpm
 pacemaker-pygui-1.4-11.9.x86_64.rpm
 pacemaker-pygui-debuginfo-1.4-11.9.x86_64.rpm
 pacemaker-pygui-devel-1.4-11.9.x86_64.rpm
 
 and started to install them, but I'm stuck in installing
 heartbeat-common package. The command
 rpm -Uvh heartbeat-common-2.99.2-6.1.x86_64.rpm produces this error
 message:
 
 rpm -Uvh heartbeat-common-2.99.2-6.1.x86_64.rpm
 warning: heartbeat-common-2.99.2-6.1.x86_64.rpm: Header V3 DSA
 signature: NOKEY, key ID 1d362aeb
 error: Failed dependencies:
 libOpenIPMI.so.0()(64bit) is needed by
 heartbeat-common-2.99.2-6.1.x86_64
 libOpenIPMIposix.so.0()(64bit) is needed by
 heartbeat-common-2.99.2-6.1.x86_64
 libOpenIPMIutils.so.0()(64bit) is needed by
 heartbeat-common-2.99.2-6.1.x86_64
 

Well, looks like you need some openipmi library packages. Try zypper
search openipmi and install the appropriate packages.

Regards
Dominik

 What I've installed so far:
 rpm -qa | egrep -i heart|openai|pace
 libheartbeat2-2.99.2-6.1
 heartbeat-resources-2.99.2-6.1
 openais-0.80.3-12.2
 libopenais2-0.80.3-12.2
 
 
 What's going wrong here ?
 
 any help appreciated.GERD.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF_ERROR_GENERIC

2009-02-04 Thread Dominik Klein
It is OCF_ERR_GENERIC, not OCF_ERROR_GENERIC. Read
/usr/lib/ocf/resource.d/heartbeat/.ocf-returncodes

You can also use ocf-tester to test your ocf script.

Regards
Dominik

lakshmipadmaja maddali wrote:
 Hi all,
 
 I have a strange issue, that ocf_error_generic is being
 ingored at times.
 
For example, suppose in the start function of the ocf script if I
explicity return ocf_error_generic, the services should shift to the
secondary node as of my knowledge, but that's not happening.  Instead,
heartbeat is calling monitor function again and again.
But if I explicity mention exit(1) in the start function, then the
heartbeat shifts its services to the secondary node.
 
Same is the case with any function in the ocf script.
 
So, why this is happening!!
 
Please help me out.
 
Regards,
padmaja
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


  1   2   3   4   >