Re: [Linux-HA] Admin of heartbeat 2.13 on Debian Lenny is a PITA

2011-01-04 Thread Tobias Appel
On 01/04/2011 12:31 PM, Imran Chaudhry wrote:
> Hi List,
>
> Has anyone found a good solution to administering an established
> 2-node cluster running heartbeat 2.13 on Debian Lenny?
>
I have 2.1.4 on RHEL5 still running. It also has the GUI (although it 
can be dangerous).

> b) Save the CIB XML, delete the CIB, hand-hack the XML and then
> re-init the cluster from scratch.
>
Aren't you exaggerating a bit? I thought cibadmin -R or -U does not 
necessarily affect running resources - correct me if I'm wrong!
And you don't have to save the whole cib, you can select the object type 
with -o resources for example.

Bye,
Tobi


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Is 'resource_set' still experimental?

2011-01-04 Thread Tobias Appel
On 12/28/2010 06:46 PM, Dejan Muhamedagic wrote:

>
> 40 order constraints? A big cluster.
>

We have currently 40 VM's (XEN) on it. I can't put them in a group since 
they have to run independently and not necessarily on the same node(s). 
To make it worse I also have location constraints and additional 
constraints for ping(d).

> Yes, the GUI should work. You can also use the crm shell to edit
> constraints. It should look sth like this:
>
> order o inf: lvm ( rsc1 rsc2 ... rscn )
>
> providing that rsc1 ... rscn don't depend on each other.
>
>> And is this feature now safe to use?
>
> Should be. But the number of resources you want to put into the
> set is unusually large. Try this first on a test cluster. It is
> also possible to use several constraints:
>
> order o1 inf: lvm ( rsc1 rsc2 ... rscm )
> order o2 inf: lvm ( rscm+1 rscm+2 ... )
> ...
>
I did it now similar to the documentation, which is pretty sleek since 
in the GUI I have only one entry now :)

 
   
 
 
   
   
 
 
   
 

And it looks good so far. But we have quite some problems with the 
cluster. A complete (friendly - i.g. I manually put one node to standby) 
failover from one node to another takes a lot of time because all the 
VMs have to be live-migrated. Now it seems that pacemaker only migrates 
one resource at a time and I remember Andrew telling me that I should 
add kind=Serialize to my resource_set. So I would just write:



correct?

Bye,
Tobi
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Is 'resource_set' still experimental?

2010-12-28 Thread Tobias Appel
Hi,

I have purchased Dr. Schwartzkopff's latest book (in German) about 
pacemaker. In his chapter about constraints he introduces the feature 
resource_set which sounds pretty neat. It would safe me quite some lines 
of configuration (for example I have like 40 individual resources which 
all require to be started after LVM - so currently I have 40 extra order 
constraints which I could fit into one using resource_set).

Though in his book he writes that this feature is quite new and might 
still be buggy and thus not recommended for productive use - I don't 
know though on which version of pacemaker he makes this assumption.

Currently we use 1.1.2. At least in the GUI I don't see any option to 
use resource_set within another constraint. Will the GUI still work if I 
manually add the required XML code to the cib?

And is this feature now safe to use?

Thanks in advance and happy new year to everyone.

Bye,
Tobi
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Question about SLES HAE License

2010-10-18 Thread Tobias Appel
Thanks for the quick info.

Just a theoretical question (I think we might subscribe anyway), what 
would keep me from download the newest trial version every few months?

On 10/18/2010 01:29 PM, Stratos Zolotas wrote:
> Yes that's correct. You can install latest packages with the trial HAE
> extension of SLES but you will no have support and updates if you don't
> subscribe.
>
> Stratos.
>
> On Mon, Oct 18, 2010 at 2:26 PM, Tobias Appel  wrote:
>
>> Hi all,
>>
>> I'm not sure this is the right place to ask, but I think you guys can
>> help me out the best.
>>
>> The thing is that we want to install a new cluster using Pacemaker.
>> Instead of using RHEL we thought about installing SLES. Now from what I
>> understand is that the only way to install pacemaker using a
>> packetmanager (yast or zypper it's called on SLES I think) is when I use
>> the HAE. This comes only as a free trial, but from what I gather is that
>> if I purchase this extension I pay for technical support, but also
>> updates are mentioned. Does this mean if I don't subscribe I will not be
>> able to download new updates via my packetmanager?
>>
>> So I either have the choice of building the source by hand for free, or
>> use the extension (but this makes only sense if I subscribe I guess).
>>
>> Did I understand this correctly?
>>
>> Thanks for your answers.
>>
>> Bye,
>> Tobi
>> ___
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Question about SLES HAE License

2010-10-18 Thread Tobias Appel
Hi all,

I'm not sure this is the right place to ask, but I think you guys can 
help me out the best.

The thing is that we want to install a new cluster using Pacemaker. 
Instead of using RHEL we thought about installing SLES. Now from what I 
understand is that the only way to install pacemaker using a 
packetmanager (yast or zypper it's called on SLES I think) is when I use 
the HAE. This comes only as a free trial, but from what I gather is that 
if I purchase this extension I pay for technical support, but also 
updates are mentioned. Does this mean if I don't subscribe I will not be 
able to download new updates via my packetmanager?

So I either have the choice of building the source by hand for free, or 
use the extension (but this makes only sense if I subscribe I guess).

Did I understand this correctly?

Thanks for your answers.

Bye,
Tobi
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Limit amount of resources migrating at the same time

2010-09-21 Thread Tobias Appel
On 09/21/2010 09:02 AM, Andrew Beekhof wrote:
> It's possible in pacemaker 1.1, but not very well documented.
> Basically you create an ordered set and set kind=Serialize

Thanks for the info. I found the documentation on ordered set but like 
you said kind=serialize is not yet included (well the documentation says 
pacemaker 1.0 duh).
Anyway this is very good information for me.

Bye,
Tobi
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Limit amount of resources migrating at the same time

2010-09-15 Thread Tobias Appel
Hi all,

it's been some time since I last worked with Heartbeat. Now a workmate 
asked me this question and hopefully I can get a short answer from you guys.

The problem is that we use Xen on a Heartbeat cluster with a lot of 
virtual machines. Now in certain cases there is a live migration of 
virtual machines from one node to another. This works fine. The only 
problem is that we don't know if we can limit the amount of resources 
(virtual machines) that get migrated at the same time.

If heartbeat tries to migrate all of the VM's at the same time, the RAM 
will run full and some errors are being thrown resulting in the resource 
going into an unmanaged state.

So I just need to know if there is an option, or constraint to limit the 
migration of resources from one node to another.

We are still using Heartbeat 2.14 or something like this, so it would 
also be good to know if the (possible) option is only included in pacemaker.

If nothing of that sort is available I will do further research on the 
issue.

Thanks in advance.

Tobi
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Switching after reboot

2009-12-17 Thread Tobias Appel
On 12/16/2009 05:42 PM, Artur Kamiński wrote:
> Andrew Beekhof pisze:
>> On Wed, Dec 16, 2009 at 8:48 AM, artur.k  wrote:
>>
>>> I have built a cluster with two nodes on pacemacker 1.0.4 + DRBD (8.0.14). 
>>> If one machine is restarted after returning pacemacker trying to switch all 
>>> services to this server. How to prevent it?
>>>
>>
>> Set default-resource-stickiness to something higher than 200.
>> If you don't want it to move under any circumstances, set it to INFINITY.
>>

Will there still be a failover if it is set to INFINITY and one resource 
can't run on the current node (fails, is restarted but fails again and 
so on)?

Bye,
Tobi
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] How to use Xinetd OCF Script?

2009-12-14 Thread Tobias Appel
Hi all,

I've been trying to get this to work for hours now. Thing is, I'm not 
really familiar with xinetd. All I can tell is that the LSB script is 
not heartbeat compatible!
Thus I wanted to use the ocf agent.
Well it only takes one parameter, the service name. Sounds easy enough.

 From what I've gathered, I'm supposed to leave the xinetd service 
running all the time and heartbeat will just enable the specific service.

Ok I got a few configuration files under /etc/xinetd.d/
one of them is my NSCA service for Nagios. But this one is usually 
started as soon as I start xinetd. Now this should not be the case on 
the passive node.

So how should I configure xinetd? I tried to disable the service in the 
xinetd.conf but then heartbeat tells me the OCF agent which I configured 
with this services' name is running but in fact it's still disabled.

Then I tried to set the default value to off in the nsca xinetd config 
file but with the same results. It seems like the OCF agent is always 
telling me it's running but in fact it does nothing at all.

Maybe I misunderstand the OCF agent and it also manages the xinetd 
service and I don't need to touch any particular service inside, but 
then what should I give it as parameter?

I hope someone can explain this to me as I have found nothing at all 
using google :(

Oh btw I'm using the OCF agent from Heartbeat 2.1.4. It seems pretty old 
- maybe I just need a new one?

Bye,
Tobi
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Updating from Heartbeat to Pacemaker

2009-12-14 Thread Tobias Appel
Hi,

I'm thinking about doing an upgrade of my Heartbeat 2.1.4 installation 
to Pacemaker.
I had a look at this site: http://www.clusterlabs.org/wiki/Upgrade

And now I'm wondering if the information is still up-to-date.

It says for Rolling Upgrade:

This method is currently broken between Pacemaker 0.6.x and 1.0.x

So does this mean I upgrade my Cluster to Pacemaker 0.6.x and then I'm 
stuck and can't go to 1.0.x ?

Am I only left with the Disconnect & Reattach option then?

I think I don't need much from the current configuration anyway, 
basically I have only one group of resources and a few constraints for 
DRBD. I could use the Disconnect & Reattach way and go directly to 
Pacemaker 1.0.x and just create a new configuration, or can't I?

Thanks for your input.

Bye,
Tobi
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] What exactly is 'is-managed-default' used for?

2009-12-14 Thread Tobias Appel
Hi,

I'm still running Heartbeat 2 and according to this site: 
http://www.linux-ha.org/ciblint/crm_config

the option means the following:

is-managed-default [boolean: default=true]: Should the cluster 
start/stop resources as required
Should the cluster start/stop resources as required

Can someone please elaborate this? The problem I'm facing now in my 
2-node cluster is the following:

- One Node dies and get's STONITH'd (poweroff)
- the Node gets powered on by us and joins the cluster again
- Heartbeat stops all resources on the current node and starts them up 
on the newly joined node

This is very inconvenient for us, as we have a short outage and also the 
DRBD disc is not yet synchronized on the failed node, so we run into 
trouble sometimes.

Can I stop this behaviour using the option mentioned above?

Thanks in advance,

Tobi
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat - execute a script on a running node when the other node is back?

2009-11-16 Thread Tobias Appel
On 11/15/2009 09:09 PM, Tomasz Chmielewski wrote:
> I have two nodes, node_1 and node_2.
>
> node_2 was down, but is now up.
>
>
> How can I execute a custom script on node_1 when it detects that node_2
> is back?
>
>

This is a little off the heartbeat list I guess, but we use Nagios to 
monitor our heartbeat clusters and you can use the eventhandler from 
nagios to execute scripts when there is a state change.

I don't think that you would want to install nagios just for this, but 
in case you are running Nagios or something similar already you might 
try this approach.

Bye,
Tobi
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Lot of core dumps found - should I worry about it?

2009-11-16 Thread Tobias Appel
On 11/16/2009 11:27 AM, Dejan Muhamedagic wrote:

>>
>> Can I stop Heartbeat from creating those?
>
> Yes. Add "coredumps false" to ha.cf. Though if something really
> goes wrong and you don't have a coredump we'd probably ask you to
> reproduce :)
>
> Thanks,
>
> Dejan

Thanks, that's fine by me since for the moment I just have to make sure 
that the root partition doesn't run full.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Lot of core dumps found - should I worry about it?

2009-11-16 Thread Tobias Appel
Hi,

well Nagios informed me today that the root partition of my Heartbeat 
Cluster is getting full. After a short investigation I found out that 
this directory has over 2 GB of size:

/var/lib/heartbeat/cores/root/

Over 250 of those files were in there:

-rw---  1 root   root   8228864 Nov 16 11:08 core.8251

Heartbeat runs fine and stable though. I know that one of the two 
Ethernet Interfaces I use for hb (eth1 and eth3) crashes a lot due to a 
driver error (problem with SUN / NVIDIA and RedHat, no fix yet) and I 
suppose that's why there is a core dump - because Heartbeat knows that 
the link is down.
Other then that I don't think that anything is wrong.

Also those core dumps happen only on the active node in our two-node 
cluster. None are on the passive node.

Can I stop Heartbeat from creating those?

Thanks in advance,
Tobi
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Constraints works for one resource but not for another

2009-08-17 Thread Tobias Appel
On 08/17/2009 01:08 PM, Dominik Klein wrote:
>
> The constraints look okay, but without logs, we cannot say why it does
> not do what you want.
>
> Also: look at the stonith device configuration: Is it okay for both
> primitives to have the same ip configured? I'd guess that will not
> successfully start the resource!? Maybe that's it already.
>
> I'd guess there was some failure before which brought up this situation
> (probably stonith start fail and stop fail)?
>
> It shouldn't turn off nodes at random. There's usually a pretty good
> reason when the cluster does this.
>
> Btw: the better way to make sure, a particular resource is only started
> on one node but never on the other is usually to configure -INFINITY for
> the "other" node instead of INFINITY for the node you want it to run on.
>
> Regards
> Dominik

Well I can certainly change the constraints to neg INFINITY.

Well as for the ip addr, like I said it's confusing. Each IPMI card is 
directly attached to one NIC in each host. That's why I can use the same 
IP. The connection is fine as well, I can go the ipmi webinterface using 
a browser just fine.

I did monitor the logs and tried to start it again and the only thing I 
found was this:

Aug 17 11:07:13 nagios2 stonithd: [5768]: info: external_run_cmd: 
Calling '/usr/lib/stonith/plugins/external/ipmi status' returned 256

then it went into failed state and tried to start on the other node, 
which of course failed as well.

Still this seems to be the problem. I just happened to find out that I 
can't connect using ipmitool from nagios2 to nagios1. Via Webbrowser it 
works fine and ipmitool works fine for nagios1 to nagios2. But this 
explains why the resource could not be started.

[r...@nagios2 ~]# ipmitool -H 11.0.0.2 -U root -P xxx  power status
Error: Unable to establish LAN session
Unable to get Chassis Power Status


So I first have to fix this but then I have to look into why heartbeat 
tried to start it on the wrong node - maybe rewriting my constraints 
will work.

Thanks for the input Dominik - as always you are my saviour :)


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Constraints works for one resource but not for another

2009-08-17 Thread Tobias Appel
Hi,

I have a very weird error with heartbeat version 2.14.

I have two IPMI resources for my two nodes. The configuration is posted 
here: http://pastebin.com/m52c1809c

node1 is named nagios1
node2 is named nagios2

now I have ipmi_nagios1 (which should run on nagios2 to shutdown nagios1)
and ipmi_nagios2 (which should run on nagios1 to shutdown nagios2).

It's confusing I know.

Now I set up to constraints which force with score infinity a resource 
to only run on their designated node.

For the resource ipmi_nagios2 it works without a problem. It only runs 
on nagios1 and is never started on nagios2. But the other resource which 
is identically configured (just the hostname differs) does not work - 
heartbeat always wants to start it on nagios1 and very seldom starts it 
on nagios2. Just now it failed to start on nagios1 and I hit clean up 
resource, waited a bit, failed again and after 3 times the cluster went 
havoc and turned off one of the nodes!

I even tried to set a constraint via the UI - it's then labeled 
cli-constraint-name but even with this as well heartbeat still tried to 
start it on the wrong node!

Now I'm really at a loss, maybe my configuration is wrong, or maybe it 
really is a bug in heartbeat.

Here is the link to the configuration again: http://pastebin.com/m52c1809c

I honestly don't know what to do anymore. I have to stop the ipmi 
service at the moment because otherwise it might randomly turn off one 
of the nodes, but without it we don't have any fencing so it's a quite 
delicate situation at the moment.

Any input is greatly appreciated.

Regards,
Tobias

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Command to see if a resource is started or not

2009-08-05 Thread Tobias Appel
On 08/05/2009 11:09 AM, Dominik Klein wrote:
> Tobias Appel wrote:
>> On 08/05/2009 10:30 AM, Dominik Klein wrote:
>>> Tobias Appel wrote:
>>>> So all I need is a command line tool to check wether a resource is
>>>> currently started or not. I tried to check the resources with the
>>>> failcount command, but it's always 0. And the crm_resource command is
>>>> used to configure a resource but does not seem to give me the status of
>>>> a resource.
>>>>
>>>> I know I can use crm_mon but I would rather have a small command since I
>>>> could include this in our monitoring tool (nagios).
>>> crm resource status
>>>
>>> Regards
>>> Dominik
>> Thanks for the fast reply Dominik,
>>
>> I forgot to mention that I still run Heartbeat version 2.1.4.
>> It seems crm_resource does not respond to the status flag. Or am I too
>> stupid?
>
> It is not crm_resource, I meant crm resource (notice the blank).
>
> But the crm command is not in 2.1.4
>
> Try crm_resource -W -r
>
> Regards
> Dominik


Thanks a lot - this is exactly what I needed!

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Command to see if a resource is started or not

2009-08-05 Thread Tobias Appel
On 08/05/2009 10:30 AM, Dominik Klein wrote:
> Tobias Appel wrote:
>>
>> So all I need is a command line tool to check wether a resource is
>> currently started or not. I tried to check the resources with the
>> failcount command, but it's always 0. And the crm_resource command is
>> used to configure a resource but does not seem to give me the status of
>> a resource.
>>
>> I know I can use crm_mon but I would rather have a small command since I
>> could include this in our monitoring tool (nagios).
>
> crm resource status
>
> Regards
> Dominik

Thanks for the fast reply Dominik,

I forgot to mention that I still run Heartbeat version 2.1.4.
It seems crm_resource does not respond to the status flag. Or am I too 
stupid?

Bye,
Tobi
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Command to see if a resource is started or not

2009-08-05 Thread Tobias Appel
Hi,

I need a command to see if a resource is started or not. Somehow my IPMI 
resource does not always start, especially on one node (for example if I 
reboot the node, or have a failover). There is no error and nothing, it 
just does nothing at all.
Usually I have to clean up the resource and then it comes back by itself.
This is not really a problem since this only occurs after a failover or 
reboot and when that happens, somebody usually takes a look at the 
cluster anyway. But some people forget to start it again, and when we do 
maintenance we have to turn it off on purpose since it would go wreck 
havoc and turn off one of the nodes.

So all I need is a command line tool to check wether a resource is 
currently started or not. I tried to check the resources with the 
failcount command, but it's always 0. And the crm_resource command is 
used to configure a resource but does not seem to give me the status of 
a resource.

I know I can use crm_mon but I would rather have a small command since I 
could include this in our monitoring tool (nagios).

Thanks in advance,

Tobi
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resources get restarted when a node joins the cluster

2009-05-29 Thread Tobias Appel
Well, exactly what I expected happened!
I set the 2nd node to standby - it had no resources running. We stopped 
Heartbeat on the 2nd node and did some maintenance. When we started 
Heartbeat again it joined the cluster as Online-standby and guess what!

The resources on node 01 were getting stopped and restarted by heartbeat!

Now why the hell did heartbeat do this and how can I stop heartbeat from 
doing this in the future?

Another very weird thing was that it did not stop all the resources.
We have configured one resource group only, containing 6 resources in 
the following order:
mount filesystem
virtual ip
afd
cups
nfs
mailto notification

it stopped the mailto and tried to stop NFS which failed since NFS was 
being in use, instead of going into an unmanage state, it just left it 
running and started mailto again.
No error was shown in crm_mon and the cluster luckily for us kept on 
running. But we did get 2 emails from mailto.

Now why did Heartbeat behave like this? We even had a constraint in 
place which forces the resource group on node 01 (score infinity).

If anyone can bring any light on this matter please do. This is 
essentiell for me.

Regards,
Tobi


Andrew Beekhof wrote:
> On Tue, May 26, 2009 at 2:56 PM, Tobias Appel  wrote:
>> Hi,
>>
>> In the past sometimes the following happened on my Heartbeat 2.1.14 cluster:
>>
>> 2-Node Cluster, all resources run one node - no location constraints
>> Now I restarted the "standby" node (which had no resources running but
>> was still active inside the cluster).
>> When it came back online and joined the cluster again 3 different
>> scenarios happened:
>>
>> 1. all resources failed over to the newly joined node
>> 2. all resources stay on the current node but get restarted!
> 
> Usually 1 and 2 occur when services are started by the node when it
> boots up (ie. not by the cluster).
> The cluster then detects this, stops them everywhere and starts them
> on just one node.
> 
> Cluster resources must never be started automatically by the node at boot 
> time.
> 
>> 3. nothing happens
>>
>> Now I don't know why 1. or 2. happen but I remember seeing a mail on the
>> mailing list from someone with a similiar problem. Is there any way to
>> make sure heartbeat does NOT touch the resources, especially not
>> restarting or re-locating them?
>>
>> Thanks in advance,
>> Tobi
>> ___
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resources get restarted when a node joins the cluster

2009-05-28 Thread Tobias Appel
Thanks Andrew, I'll double check that nothing gets started automatically.

Wish me luck :)

Andrew Beekhof wrote:
> On Tue, May 26, 2009 at 2:56 PM, Tobias Appel  wrote:
>> Hi,
>>
>> In the past sometimes the following happened on my Heartbeat 2.1.14 cluster:
>>
>> 2-Node Cluster, all resources run one node - no location constraints
>> Now I restarted the "standby" node (which had no resources running but
>> was still active inside the cluster).
>> When it came back online and joined the cluster again 3 different
>> scenarios happened:
>>
>> 1. all resources failed over to the newly joined node
>> 2. all resources stay on the current node but get restarted!
> 
> Usually 1 and 2 occur when services are started by the node when it
> boots up (ie. not by the cluster).
> The cluster then detects this, stops them everywhere and starts them
> on just one node.
> 
> Cluster resources must never be started automatically by the node at boot 
> time.
> 


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Resources get restarted when a node joins the cluster

2009-05-26 Thread Tobias Appel
Hi,

In the past sometimes the following happened on my Heartbeat 2.1.14 cluster:

2-Node Cluster, all resources run one node - no location constraints
Now I restarted the "standby" node (which had no resources running but 
was still active inside the cluster).
When it came back online and joined the cluster again 3 different 
scenarios happened:

1. all resources failed over to the newly joined node
2. all resources stay on the current node but get restarted!
3. nothing happens

Now I don't know why 1. or 2. happen but I remember seeing a mail on the 
mailing list from someone with a similiar problem. Is there any way to 
make sure heartbeat does NOT touch the resources, especially not 
restarting or re-locating them?

Thanks in advance,
Tobi
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How to make LSB script Hearbeat compliant

2009-05-11 Thread Tobias Appel
Thanks - that was exactly the website I was looking for. Going to save 
it now right away :)

Michael Schwartzkopff wrote:

> 
> http://www.linux-ha.org/LSBResourceAgent
> 
> 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] How to make LSB script Hearbeat compliant

2009-05-11 Thread Tobias Appel
Hi,

I know there was an document on linux-ha.org but I just can't find it 
anymore.
It showed what the exit codes should be and how to run scripts from the 
shell and print out their exit status to make them heartbeat compliant.

Does anyone have a working link for this site please? I really should 
have copied it earlier :(

Bye,
Tobias
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Broken web site documentation

2009-04-28 Thread Tobias Appel
I don't know why it's offline but why don't you use pacemaker?
It's got a really good documentation as well:

http://www.clusterlabs.org/wiki/Documentation


Philippe Meloche wrote:
> Hi,
> 
> I just started using the Heartbeat V2 project and suddenly, the web pages 
> about CIB configuration are gone.
> When I access http://www.linux-ha.org/ClusterInformationBase/UserGuide, I get 
> a page not found.
> 
> Does anybody knows why these pages went offline ?
> 
> Regards,
> 
> Philippe
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Nodes shooting each other at the same time

2009-02-13 Thread Tobias Appel
Well I have configured STONITH external/ipmi and have 2 resources set
up, with a constraint to run each corresponding resource on each node
but when I pull the cross-over cable between those nodes they shoot each
other and both poweroff. 
Is this how it is intended? 

I mean if I delete one of those stonith resources then only one machine
can be powered off, that's not really helpful. I guess that a 2-node
cluster just doesn't work, but then it should have been noted somewhere
in the documentation! 

Bye,
Tobi

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Generell STONITH resource configuration question

2009-02-10 Thread Tobias Appel
Hi,

sorry for so many posts about STONITH but the documentation (at least on
linux-ha.org) is somewhat lacking in this department.

Just to get it right, I'm using IPMI as STONITH device. I have to set up
two resources in a 2-Node cluster and add constraints in order to force
each resource to run on a separate nodes.

Is this approach correct? Anything else I need to worry about when
creating the constraints / resources? 

Thanks in advance,
Tobias

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How to set up 2 PDU with 2 power circuits?

2009-02-10 Thread Tobias Appel
On Mon, 2009-02-09 at 15:25 +0100, Dejan Muhamedagic wrote:
> Hi,
> 
> On Mon, Feb 09, 2009 at 01:00:39PM +0100, Tobias Appel wrote:
> > Hi,
> > 
> > thanks to Dominik (he pointed me in the right direction) I want to order
> > two PDUs from APC. 
> > 
> > Right now our 2-node cluster is set up like this:
> > http://home.in.tum.de/~appel/eso/cluster.jpg
> > 
> > Is it possible to place each PDU on each power circuit and connect each
> > node to both at the same time? STONITH would have to log in to two
> > different PDU and turn off the corresponding port. 
> 
> No, that's not possible. But thanks for bringing this up.
> 
> > If it's not possible I will just hook up each server to one PDU.
> 
> If you have such an elaborate power scheme, you could use the
> lights-out devices, since it's not very probable that nodes will
> stay out of power.

thanks Dejan! 
You pointed me in the right direction! I totally forgot about the ILOM
Management port on the SUN Servers. This should be IPMI compatible. 

I will give it a try!

> 
> Thanks,
> 
> Dejan
> 
> > thanks in advance,
> > 
> > Tobias
> > 
> > ___
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Anyone ever used STONITH with Rittal CMC-TC ?

2009-02-10 Thread Tobias Appel
Hi,

I was just wondering if anyone used STONITH agents with products from
Rittal? Namely the CMC-TC which allows programmable PowerSupply Units,
similiar to APC.

If anyone used these products and can let me know if there are any known
problems I'd appreciate it.

Thanks,

Tobias

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] How to set up 2 PDU with 2 power circuits?

2009-02-09 Thread Tobias Appel
Hi,

thanks to Dominik (he pointed me in the right direction) I want to order
two PDUs from APC. 

Right now our 2-node cluster is set up like this:
http://home.in.tum.de/~appel/eso/cluster.jpg

Is it possible to place each PDU on each power circuit and connect each
node to both at the same time? STONITH would have to log in to two
different PDU and turn off the corresponding port. 

If it's not possible I will just hook up each server to one PDU.

thanks in advance,

Tobias

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Nodes shooting each other over and over (STONITH)

2009-02-09 Thread Tobias Appel
Hi,

I've configured stonith/ssh on my 2-node cluster (Heartbeat 2.1.4). This
is still in testing since I haven't happen to find a good STONITH device
yet.
Anyway, I ran some tests this morning and pulled the cross-over cable.
After a few seconds one node rebooted the other one. It came back and
could not connect to the cluster so it rebooted the active one. Once
this came back on, it did the same thing again. Both nodes kept
rebooting each other over and over.

Is this a configuration issue or is it working like intended? If it's
the latter one, how is it handled when using something like a
programmable PDU from APC?

If it's a configuration issue, I will have a closer look at my
configuration and post it here or in IRC.

Thanks in advance,

Tobias

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] STONITH / ssh not working in a 2-node cluster

2009-02-06 Thread Tobias Appel
On Fri, 2009-02-06 at 14:52 +0100, Dominik Klein wrote:
> Just guessing, but your cluster does not know about those "external"
> addresses, does it?
> 
Well I have an internal ip added to /etc/hosts and a shortcut for the
hostname, like nagios1 192.168.0.1 and then there is the full domain
name like nagios1.hq.xxx.xxx 134.xxx.xxx.xxx and so on. 

Usually the cluster communication uses eth1 and the private IP adresses
only! so when I pull that cable, network connectivity is still there but
no communication between the cluster nodes except if you use the
external ip adress. 

So I gave stonith_ssh as hostlist the full domain name.

> Sounds like you're only using one connection between the nodes for
> cluster communication, pull that, see split-brain and want the cluster
> to use a connection it does not know about.

true the cluster doesn't really know about the other connection, but as
long as it uses the fulldomain name it should be able to connect to the
other node using the external ip - at least when I do this manually it
works.

In the logfiles it looks like it is trying to shoot each other, but like
I said nothing is happening, but at the same time only one node had
active resources - before implementing stonith_ssh both nodes would have
run all resources at the same time.

And yeah atd is running (did a ps -aux | grep atd).


> But even if you configured the cluster to also use the "external" path,
> you'd have to unplug both to force them into a splitbrain scenario and
> ssh stonith wouldnt help again (no useable, known connection left).
> 
> Thats exactly why ssh stonith is bad.
> 
> The only way stonith ssh "works" is for cluster software failure (like
> pkill -9 heartbeat).
> 
Ok that makes kinda sense, but still it should work in a 2-node setup,
and it shouldn't be waiting for more members to vote?

> Regards
> Dominik
> 
> Tobias Appel wrote:
> > Hi,
> > 
> > I know I should use it only for development purposes and that's what I
> > am doing right now.
> > 
> > I have configured my cluster and a resource stonith_ssh. I configured
> > stonith_ssh to use the external IP address to connect to the other
> > server. When I ran: 'stonith -t ssh -p "nagios1.hq.xxx" -T reset
> > nagios1.hq.xxx' it worked. The other node rebooted immediately. 
> > 
> > But when I pulled the cross-over cable the results were weird. Each node
> > had the same entry in the logfile which looked like this:
> > 
> > Feb  6 12:57:59 nagios2 tengine: [27786]: info: te_fence_node: Executing
> > reboot fencing operation (63) on nagios1 (timeout=3)
> > Feb  6 12:57:59 nagios2 stonithd: [27368]: info: client tengine [pid:
> > 27786] want a STONITH operation RESET to node nagios1.
> > Feb  6 12:57:59 nagios2 stonithd: [27368]: info: Broadcasting the
> > message succeeded: require others to stonith node nagios1.
> > 
> > Then later it said:
> > Feb  6 12:58:29 nagios2 stonithd: [27368]: ERROR: Failed to STONITH the
> > node nagios1: optype=RESET, op_result=TIMEOUT
> > 
> > I'm not sure why it did not reboot the other machine. Does 'require
> > others' mean that it won't work in a 2-node setup? 
> > 
> > The really weird part is that even though no node was rebooted, only one
> > node continued to run the resources, whereas the other was just doing
> > nothing at all. When I reconnected the cross-over cable both nodes went
> > into 'standby' state for a couple of seconds but everything was working
> > fine.
> > 
> > Regards,
> > Tobi
> > 
> > ___
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> > 
> 
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] STONITH / ssh not working in a 2-node cluster

2009-02-06 Thread Tobias Appel
Hi,

I know I should use it only for development purposes and that's what I
am doing right now.

I have configured my cluster and a resource stonith_ssh. I configured
stonith_ssh to use the external IP address to connect to the other
server. When I ran: 'stonith -t ssh -p "nagios1.hq.xxx" -T reset
nagios1.hq.xxx' it worked. The other node rebooted immediately. 

But when I pulled the cross-over cable the results were weird. Each node
had the same entry in the logfile which looked like this:

Feb  6 12:57:59 nagios2 tengine: [27786]: info: te_fence_node: Executing
reboot fencing operation (63) on nagios1 (timeout=3)
Feb  6 12:57:59 nagios2 stonithd: [27368]: info: client tengine [pid:
27786] want a STONITH operation RESET to node nagios1.
Feb  6 12:57:59 nagios2 stonithd: [27368]: info: Broadcasting the
message succeeded: require others to stonith node nagios1.

Then later it said:
Feb  6 12:58:29 nagios2 stonithd: [27368]: ERROR: Failed to STONITH the
node nagios1: optype=RESET, op_result=TIMEOUT

I'm not sure why it did not reboot the other machine. Does 'require
others' mean that it won't work in a 2-node setup? 

The really weird part is that even though no node was rebooted, only one
node continued to run the resources, whereas the other was just doing
nothing at all. When I reconnected the cross-over cable both nodes went
into 'standby' state for a couple of seconds but everything was working
fine.

Regards,
Tobi

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Looking for efficient STONITH device - must not be single-point-of-failure

2009-02-06 Thread Tobias Appel
On Fri, 2009-02-06 at 13:28 +0100, Michael Schwartzkopff wrote:
> Am Freitag 06 Februar 2009 12:53 schrieb Tobias Appel:
> > Hi all,
> >
> > well to be honest I don't know that much about STONITH yet, but I know
> > what it's supposed to do.
> > In our setup we have only 2 nodes, running Heartbeat 2.14 connected via
> > a cross-over cable (side question, is it possible to use more then one
> > NIC to connect the two servers, thus when one fails a backup NIC is
> > always available?).
> 
> Yes. You also could use bonging. Good for DRBDs.
> 
> Michael.

You did mean bonding, right?

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Looking for efficient STONITH device - must not be single-point-of-failure

2009-02-06 Thread Tobias Appel
Hi all,

well to be honest I don't know that much about STONITH yet, but I know
what it's supposed to do. 
In our setup we have only 2 nodes, running Heartbeat 2.14 connected via
a cross-over cable (side question, is it possible to use more then one
NIC to connect the two servers, thus when one fails a backup NIC is
always available?).

Right now when the link is severed we have a split-brain state, since
each node thinks that the other one is dead.

Now I did some research about STONITH devices, but there is not that
much information. I know that we could use an UPS from APC for example,
and if a split-brain occurs each server tries to telnet into the UPS and
turn off the power for the other machine.

Now my boss don't really like this idea because if we connect both
servers to the same UPS we have a single point of failure again. Is
there any other solution for a STONITH device which is not a spof?

Regards,
Tobi

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Alternative to monitor network instead of using pingd?

2009-02-02 Thread Tobias Appel
On Mon, 2009-02-02 at 13:45 +0100, Michael Schwartzkopff wrote:

> 
> Since you have only one ping node make DRBD run on a node with pingd points:
> 
>   
>attribute="pingd" operation="not_defined"/>
>attribute="pingd" operation="lte" value="0"/>
>   
> 
> 
> in this case it doesn't matter if your DRBD is master or slave.
> 
> Please have a look about the real score calulation with the showscores script.

Oh my god! This looks promising - it looks nothing like any of the
examples I have been looking at. And what a surprise it worked! It
stopped the DRBD service as well and the failover worked within seconds!

Thank you so much. I would have never been able to write a constraint
like this myself!

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Alternative to monitor network instead of using pingd?

2009-02-02 Thread Tobias Appel
On Mon, 2009-02-02 at 13:16 +0100, Michael Schwartzkopff wrote:
> Am Montag, 2. Februar 2009 12:57:07 schrieb Tobias Appel:
> (...)
> > It's 90% configured by hand through xml files. I would need a constraint
> > that kills  the resource group, demoted DRBD and promotes the other node
> > to DRBD master and then start the resource group over there - but this
> > seems to be not possible with heartbeat 2.1.4. I really tried every
> > combination of constraint for pingd but it just does not work the way it
> > has to be.
> > Either it tries to promote the other node to DRBD master while the
> > current node still runs all the resource and returns an error or it does
> > nothing at all! It does not even stop the resource group.
> >
> > I've put one year work into this project and now I fail on the last and
> > final step to making this HA cluster. I can't tell my boss that it will
> > failover in 90% of the scenarios but not when the ethernet connection is
> > down.
> 
> hi,
> 
> That setup is quite typical. I did this in a lot of szenarios and there 
> should 
> not be any problems. If your send your CIB, perhaps we can help.
> 
I posted it in the IRC channel already, Andrew told me that my
constraint is faulty - well I expected that. 
I then read the documentation from pacemaker, Andrew really did a
wonderful job putting those pdf's together! There are a lot of examples
for pingd but nothing worked out for me.
HEre is my complete cib.xml: http://pastebin.com/d58a375a1


All examples I have found either have only one drbd resource or one
resource group without drbd.



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Alternative to monitor network instead of using pingd?

2009-02-02 Thread Tobias Appel
On Mon, 2009-02-02 at 12:05 +0100, Michael Schwartzkopff wrote:
> Am Montag, 2. Februar 2009 11:15:35 schrieb Tobias Appel:
> > Well I've given up on pingd, I just can't get it to work with DRBD and a
> > resource group. The constraints do shit and the cluster does nothing if
> > I pull the ethernet cable.
> > Can I do anything else to get it work? Maybe use STONITH to kill the
> > node which ethernet's cable has been pulled out? Or have it kill
> > itself.
> >
> > I'm really going crazy over this. My last resort would be to use another
> > software which monitors the network interface and which then would stop
> > the heartbeat service, but maybe, just maybe there is a simple solution
> > somewhere to be found within heartbeat.
> >
> > Best Regards,
> > Tobi
> 
> hi,
> 
> what time did you wait until you want the cluster to react?
I gave it a couple of minutes and I saw in the log files that it tried
to make the other node DRBD master but it said there can only be one
master (of course, that's how the M/S resource has been configured)
> 
> deadping option of ha.cf is 20 secs by default. Add 5 sec for the dumping and 
> get get in the range of 30 secs. Did you wait that long? If you want faster 
> reaction -> change deadping in ha.cf
> 
> Soes the logfile show the changes? What does
> cibadmin -Q -o status | grep pingd
> say on BOTH nodes?
> 
it registers that the ping group is dead. Now I changed my constraint
because Andrew gave a hint (he has some nice examples in the pacemaker
configuration which should be working for heartbeat 2.1.4 as well).
I know that my constraint is wrong but I just can't get it to work! I
tried all possible combinations now, I tried every example from your
book and from Andrew's documentation - of course none of these scenario
is similiar to mine - maybe it's just not possible to have it working in
my scenario. 

> In my setups I saw that DRBD sometimes reacts quite slow. Just configure 
> everything with the CLI. That prevents from errors people make clicking too 
> fast in GUIs ;-)

It's 90% configured by hand through xml files. I would need a constraint
that kills  the resource group, demoted DRBD and promotes the other node
to DRBD master and then start the resource group over there - but this
seems to be not possible with heartbeat 2.1.4. I really tried every
combination of constraint for pingd but it just does not work the way it
has to be.
Either it tries to promote the other node to DRBD master while the
current node still runs all the resource and returns an error or it does
nothing at all! It does not even stop the resource group.

I've put one year work into this project and now I fail on the last and
final step to making this HA cluster. I can't tell my boss that it will
failover in 90% of the scenarios but not when the ethernet connection is
down. 

bye,
tobi

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Alternative to monitor network instead of using pingd?

2009-02-02 Thread Tobias Appel
Well I've given up on pingd, I just can't get it to work with DRBD and a
resource group. The constraints do shit and the cluster does nothing if
I pull the ethernet cable. 
Can I do anything else to get it work? Maybe use STONITH to kill the
node which ethernet's cable has been pulled out? Or have it kill
itself. 

I'm really going crazy over this. My last resort would be to use another
software which monitors the network interface and which then would stop
the heartbeat service, but maybe, just maybe there is a simple solution
somewhere to be found within heartbeat.

Best Regards,
Tobi

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Can not add notes to cluster

2009-02-02 Thread Tobias Appel
Hi Christian,

the nodes are only configured in the ha.cf file.
Be sure to have the /etc/hosts file configured correctly and I don't see
any broadcast interface in your ha.cf ? Maybe that's why it can't find
the other node. Have a look at the sample configuration (usually found
in: /usr/share/doc/packages/heartbeat/ha.cf)

Looks something like this (you can decide to use unicast, broadcast,
multicast or serial for connection between the nodes):

#   What UDP port to use for bcast/ucast communication?
#
#udpport694
#
#   Baud rate for serial ports...
#
#baud   19200
#   
#   serial  serialportname ...
#serial /dev/ttyS0  # Linux
#serial /dev/cuaa0  # FreeBSD
#serial /dev/cuad0  # FreeBSD 6.x
#serial /dev/cua/a  # Solaris
#
#
#   What interfaces to broadcast heartbeats over?
#
bcast   eth1# Linux
#bcast  eth1 eth2   # Linux
#bcast  le0 # Solaris
#bcast  le1 le2 # Solaris
#
#   Set up a multicast heartbeat medium
#   mcast [dev] [mcast group] [port] [ttl] [loop]
#
#   [dev]   device to send/rcv heartbeats on
#   [mcast group]   multicast group to join (class D multicast
address
#   224.0.0.0 - 239.255.255.255)
#   [port]  udp port to sendto/rcvfrom (set this value to
the
#   same value as "udpport" above)
#   [ttl]   the ttl value for outbound heartbeats.  this
effects
#   how far the multicast packet will propagate.
(0-255)
#   Must be greater than zero.
#   [loop]  toggles loopback for outbound multicast
heartbeats.
#   if enabled, an outbound packet will be looped
back and
#   received by the interface it was sent on. (0 or
1)
#   Set this value to zero.
#   
#
#mcast eth0 225.0.0.1 694 1 0
#
#   Set up a unicast / udp heartbeat medium
#   ucast [dev] [peer-ip-addr]
#
#   [dev]   device to send/rcv heartbeats on
#   [peer-ip-addr]  IP address of peer to send packets to
#
#ucast eth0 192.168.1.2



Regards,
Tobi




On Sat, 2009-01-31 at 16:14 +0100, Christian Schoepplein wrote:
> Hi,
> 
> I try to build a 2 node cluster with heartbeat 2.1.4-4 on Debian. No 
> ressources and constrains configured yet.
> 
> --- /etc/ha.d/authkeys ---
> auth 1
> 1 md5 Hello!
> -
> 
> -- /etc/ha.d/ha.cf
> logfacility local7
> logfile /var/log/ha-log
> debugfile /var/log/ha-debug
> #use_logd on
> udpport 694
> keepalive 1 # 1 second
> deadtime 10
> initdead 80
> bcast eth0
> node amd64
> node cs
> ping 192.168.1.100
> crm yes
> auto_failback yes
> autojoin any
> ---
> 
> The files above reside on both nodes amd64 and cs.
> 
> But on every node only the local machine shows up as node in the 
> cluster, they can't find each other. From the logs on amd64:
> 
> crmd[11414]: 2009/01/31_15:58:37 WARN: get_uuid: Could not calculate 
> UUID for cs
> crmd[11414]: 2009/01/31_15:58:37 WARN: populate_cib_nodes: Node cs: no 
> uuid found
> crmd[11414]: 2009/01/31_15:58:38 notice: populate_cib_nodes: Node: amd64 
> (uuid: ee86bba2-24b3-4b21-9cbd-1a0a6c75bdaa)
> crmd[11414]: 2009/01/31_15:58:38 info: do_ha_control: Connected to 
> Heartbeatcrmd[11414]: 2009/01/31_15:58:38 info: do_ccm_control: CCM 
> connection establishe
> d... waiting for first callback
> crmd[11414]: 2009/01/31_15:58:38 info: do_started: Delaying start, CCM 
> (
> 0010) not connected
> crmd[11414]: 2009/01/31_15:58:38 info: crmd_init: Starting crmd's 
> mainloop
> crmd[11414]: 2009/01/31_15:58:38 notice: crmd_client_status_callback: 
> Status update: Client amd64/crmd now has status [online]
> 
> The cib.xml file looks as follows, realy only one node appears :(:
> 
>  have_quorum="true" ignore_dtd="false" num_peers="1" 
> cib_feature_revision="2.0" 
> crm_feature_set="2.0" cib-last-written="Sat Jan 31 16:00:00 2009" 
> ccm_transition="1" dc_uuid="ee86bba2-24b3-4b21-9cbd-1a0a6c75bdaa">
>
>  
>
>  
> name="dc-version" value="2.1.4-node: 
> aa909246edb386137b986c5773344b98c696"/>
>  
>
>  
>  
> type="normal"/>
>  
>  
>  
>
> 
> 
> The file has been autogenerated during the startup of heartbeat.
> 
> I tryed to add the second node with the following command:
> 
> amd64:~# /usr/lib/heartbeat/hb_addnode cs
> 
> The following shows up in the log:
> 
> heartbeat[11396]: 2009/01/31_16:07:41 info: hb_add_one_node: Adding new 
> node[cs] to configuration.
> heartbeat[11396]: 2009/01/31_16:07:41 ERROR: hb_add_one_node: node(cs) 
> already exists
> heartbeat[11396]: 2009/01/31_16:07:41 ERROR: Add node cs failed
> 
> Whats going wrong there? Why is'nt the node found during startup or how 
> can I add the node permanently to the cib.xml file? I can't use the GUI 
> unfortunatly so any other textbased solution would be great!
> 
> 
> Cheers and sorry for the 

Re: [Linux-HA] Can't get the ping_group to work - problem with location constraints and DRBD

2009-01-30 Thread Tobias Appel
On Fri, 2009-01-30 at 12:16 +0100, Dejan Muhamedagic wrote:
> Hi,
> 
> On Fri, Jan 30, 2009 at 11:35:54AM +0100, Tobias Appel wrote:
> > Hi,
> > 
> > I'm using Heartbeat 2.1.4 and I tried to setup a ping_group so that if I
> > pull the ethernet cables on one of the two nodes a failover will occur.
> > 
> > I have configured a ping_group group 1 in ha.cf and created a clone
> > resource with pingd.
> > 
> > The problem is, I have a resource group + a master / slave resource for
> > DRBD. the resource group only runs on the node which is currently
> > master. Now I added another constraint for pingd, so that the failover
> > occurs, but it does not work - nothing is happening, I think it has to
> > do with the DRBD resource. Here are my current constraints:
> > 
> >  
> > > to="ms_drbd" to_action="promote"/>
> > > from="nagios" to="ms_drbd" score="INFINITY"/>
> >
> >  
> > > id="66155d1f-2210-45ad-9010-a14e48825ead" operation="defined"/>
> >  
> >
> >  
> > 
> > I'm not quite sure how to change the location constraint to the
> > following:
> > -if you are the active node and the ping group is not reachable, stop
> > resource group "nagios"
> > - promote the other node to master for ms_drbd and start resource group
> > nagios there
> 
> Add role=Master to the pingd location constraint. I think this
> was also described in the drbd howto.

I have read the DRBD howto's on linux-ha.org but this is only for the
basic setup, of course with a resource group, but not with pingd and
drbd and a resource group. I added role="Master" to the constraint but
it still will not work:


 
   
 
   

Btw, I did find some useful info here (even if it is for pacemaker):
http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0

But like I said, it still does not work. Heartbeat does not stop the
resource group and does not promote the other node to master. Logfile
still says:

Jan 30 12:07:07 nagios2 pengine: [3758]: info: master_color: Promoting
resource_drbd:0 (Master nagios2)
Jan 30 12:07:07 nagios2 pengine: [3758]: info: master_color: ms_drbd:
Promoted 1 instances of a possible 1 to master
Jan 30 12:07:07 nagios2 pengine: [3758]: info: master_color: ms_drbd:
Promoted 1 instances of a possible 1 to master
Jan 30 12:07:07 nagios2 pengine: [3758]: notice: NoRoleChange: Leave
resource resource_filesys  (Started nagios2)

and so on...

one would not believe how complicated the most basic check for network
connectivity can be. *sigh*

> 
> > The logfile looks like this (especially line 3 mentions the problem with
> > DRBD master):
> > 
> > Jan 30 10:24:12 nagios2 pengine: [3758]: WARN: text2task: Unsupported
> > action: status
> 
> You should use monitor, not status. Don't know what does your
> configuration look like.
> 
I know, but I don't even know what text2task is. I don't have any
resource by that name and google didn't help much except that many
people have the same error message in their logfiles apparently but I
haven't figured out what kind of software that is.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] stonith suicide - still valid ?

2009-01-30 Thread Tobias Appel
Hi,

well I'm running a 2 node cluster and I don't have a stonith device like
a UPS or anyting and the problem is, if I pull the ethernet cable which
directly connects those 2 servers both think the other node is dead and
they both start all the resources at the same time - when I reconnect
them later I have a split-brain scenario.

Now I read in Dr. Schwartzkopff's book that the suicid script for
stonith is no longer valid because heartbeat can't shoot itself anymore.
Is this true or can I still use it with Heartbeat 2.1.4 ?

Bye,
Tobi

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Can't get the ping_group to work - problem with location constraints and DRBD

2009-01-30 Thread Tobias Appel
Hi,

I'm using Heartbeat 2.1.4 and I tried to setup a ping_group so that if I
pull the ethernet cables on one of the two nodes a failover will occur.

I have configured a ping_group group 1 in ha.cf and created a clone
resource with pingd.

The problem is, I have a resource group + a master / slave resource for
DRBD. the resource group only runs on the node which is currently
master. Now I added another constraint for pingd, so that the failover
occurs, but it does not work - nothing is happening, I think it has to
do with the DRBD resource. Here are my current constraints:

 
   
   
   
 
   
 
   
 

I'm not quite sure how to change the location constraint to the
following:
-if you are the active node and the ping group is not reachable, stop
resource group "nagios"
- promote the other node to master for ms_drbd and start resource group
nagios there

The logfile looks like this (especially line 3 mentions the problem with
DRBD master):

Jan 30 10:24:12 nagios2 pengine: [3758]: WARN: text2task: Unsupported
action: status
Jan 30 10:24:12 nagios2 pengine: [3758]: info: master_color: Promoting
resource_drbd:0 (Master nagios2)
Jan 30 10:24:12 nagios2 pengine: [3758]: info: master_color: ms_drbd:
Promoted 1 instances of a possible 1 to master
Jan 30 10:24:12 nagios2 pengine: [3758]: info: master_color: ms_drbd:
Promoted 1 instances of a possible 1 to master
Jan 30 10:24:12 nagios2 pengine: [3758]: notice: NoRoleChange: Leave
resource resource_filesys  (Started nagios2)
Jan 30 10:24:12 nagios2 pengine: [3758]: notice: NoRoleChange: Leave
resource nagios-vip(Started nagios2)
Jan 30 10:24:12 nagios2 pengine: [3758]: notice: NoRoleChange: Leave
resource resource_http (Started nagios2)
Jan 30 10:24:12 nagios2 pengine: [3758]: notice: NoRoleChange: Leave
resource resource_mysql(Started nagios2)
Jan 30 10:24:12 nagios2 pengine: [3758]: notice: NoRoleChange: Leave
resource nagios-core   (Started nagios2)
Jan 30 10:24:12 nagios2 pengine: [3758]: WARN: text2task: Unsupported
action: status
Jan 30 10:24:12 nagios2 pengine: [3758]: notice: NoRoleChange: Leave
resource MailNotify(Started nagios2)
Jan 30 10:24:12 nagios2 pengine: [3758]: notice: NoRoleChange: Leave
resource resource_drbd:0   (Master nagios2)
Jan 30 10:24:12 nagios2 pengine: [3758]: notice: NoRoleChange: Leave
resource resource_drbd:1   (Slave nagios1)
Jan 30 10:24:12 nagios2 pengine: [3758]: notice: NoRoleChange: Leave
resource resource_drbd:0   (Master nagios2)
Jan 30 10:24:12 nagios2 pengine: [3758]: notice: NoRoleChange: Leave
resource resource_drbd:1   (Slave nagios1)
Jan 30 10:24:12 nagios2 pengine: [3758]: notice: NoRoleChange: Leave
resource resource_pingd:0  (Started nagios1)
Jan 30 10:24:12 nagios2 pengine: [3758]: notice: NoRoleChange: Leave
resource resource_pingd:1  (Started nagios2)


on a side note, I wonder what text2task is...this error comes frequently
in my /var/log/messages...


any help is appreciated.

bye,
tobi

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] cibadmin won't parse input file

2009-01-30 Thread Tobias Appel
On Mon, 2009-01-26 at 13:12 -0500, Karl W. Lewis wrote:
> On Mon, Jan 26, 2009 at 9:24 AM, Dominik Klein  wrote:
> 
> > Tobias Appel wrote:
> > > Hi,
> > >
> > > cibadmin just won't parse my input file, I've rewritten it twice now and
> > > can't spot the error - maybe I haven't had enough coffee yet but this
> > > feels like one of those games where you have two pictures and need to
> > > spot the errors...sigh
> > >
> > > my xml looks like this:
> > >
> > > 
> > >  
> > >
> >
> > closing ' vs. opening "
> >
> > >  
> > > 
> > >
> > >
> > > who finds the error gets a cookie.
> >
> > Regards
> > Dominik

hands out a cookie! thanks a lot, I really didn't have had enough coffee
on that day.

> 
> 
> Well, if no one else is going to say it, I will.  That, sir, is a nice
> catch.  Even after reading your response it took me some time to find it.
> 
> KWL


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] resource_stickiness and groups - how it is calculated?

2009-01-26 Thread Tobias Appel
On Mon, 2009-01-26 at 13:04 +0100, Andrew Beekhof wrote:
> On Mon, Jan 26, 2009 at 12:10, Tobias Appel  wrote:
> > Well I've got a lot of questions today as you can see :)
> >
> > I have a group of resources which is ordered and colocated (due to drbd
> > master / slave constraint). I added a monitor operation to nearly all
> > members of this group with a on_fail restart setting.
> > Now I'm wondering do I have to add resource_stickiness for each resource
> > or just for the group?
> > And how is it calculated if I add it to the group. Right now I have a
> > resource_stickiness with a value of 100 and a failure_stickiness
> 
> Don't go there.  Seriously.  We made a big mistake implementing it like that.
> Get the latest version of Pacemaker, look up migration-threshold, and
> pretend you've never heard the phrase "failure stickiness".

Well now you got me. I'm running Heartbeat 2.1.4 I finally understood
most of the configuration but I did try to move to pacemaker once and it
was horrible because I could not even get a single resource to work. I
really don't want to upgrade now, unless you are telling me that it will
not work with my current version. The link Michael send me made sense to
me.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Monitor Operation should restart but resource goes into failed state?

2009-01-26 Thread Tobias Appel
On Mon, 2009-01-26 at 13:08 +0100, Michael Schwartzkopff wrote:

> 
> Did you check if your lsb RA (aka init script) behaves like defined in the 
> LSB?
> See:
> http://www.linux-ha.org/LSBResourceAgent
> 
> MOST (!) of the scripts provided from the distributions do not like defined.
> 
> 
Ah I see!

This is quite an important link. I did a quick test with Nagios and when
it is stopped and I run the script with status it returns:

[r...@nagios2 ~]# /etc/init.d/nagios status ; echo "result: $?"
No lock file found in /usr/local/nagios/var/nagios.lock
result: 1

But heartbeat is expecting result: 3! 

Well that explains a lot. I will get right on it!

Thanks again!

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] cibadmin won't parse input file

2009-01-26 Thread Tobias Appel
On Mon, 2009-01-26 at 13:01 +0100, Andrew Beekhof wrote:
> what version of the software are you running, what does the rest of
> the config look like and what was the error

I'm running heartbeat 2.1.4 and this is the only part of the xml file. I
tried to create this new location restraint so that my drbd master gets
relocated to another node should pingd fail (I have another constraint
which lets my resource group only run where the DRBD master is located,
so that should fail over as well).

So I created the xml file below and tried to: 'cibadmin -C -x loc.xml'
the output is: 'Could not parse input file: loc.xml' 
So I'm guessing there is a typo somewhere. Of course the file is
readable for every user. 

p.s. sorry for replying directly to you Andrew.
> 
> On Mon, Jan 26, 2009 at 11:59, Tobias Appel  wrote:

> > my xml looks like this:
> >
> > 
> >  
> >   
> >  
> > 
> >
> >
> > who finds the error gets a cookie.
> >
> > Thanks,
> > Tobi
> >
> > ___
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Monitor Operation should restart but resource goes into failed state?

2009-01-26 Thread Tobias Appel
Hi,

I've added Monitor Operations to most of my resources and status
operations to the ones I only have a lsb script for.

I then stopped the resource not via the cluster but just via it's init
script. I thought heartbeat would try to restart the resource, but
instead it said: (unmanaged) failed and did nothing.

I tried the same for apache webserver which runs via ocf script, here it
also went into a failed state and did not restart but it was not
unmanaged.

I manually had to intervene then and cleanup the resource before it
worked again.

Here are parts of my resource configuration in the cib, where did I mess
up? On another machine I did exactly the same and iirc it worked.

 
   
 
   
   
 
   
   
 
   
 

 
   
 
   
 


Bye,
Tobi

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] resource_stickiness and groups - how it is calculated?

2009-01-26 Thread Tobias Appel
Well I've got a lot of questions today as you can see :)

I have a group of resources which is ordered and colocated (due to drbd
master / slave constraint). I added a monitor operation to nearly all
members of this group with a on_fail restart setting. 
Now I'm wondering do I have to add resource_stickiness for each resource
or just for the group?
And how is it calculated if I add it to the group. Right now I have a
resource_stickiness with a value of 100 and a failure_stickiness with a
value of -30 for the whole group. But if one resource fails I'm not sure
if only this particular resource is restarted or all members of the
group (thus multiplying the failure_stickiness by the amount of members
in the group).

Thanks in advance for any input.

Regards,
Tobi

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] cibadmin won't parse input file

2009-01-26 Thread Tobias Appel
Hi,

cibadmin just won't parse my input file, I've rewritten it twice now and
can't spot the error - maybe I haven't had enough coffee yet but this
feels like one of those games where you have two pictures and need to
spot the errors...sigh

my xml looks like this:


 
   
 



who finds the error gets a cookie.

Thanks,
Tobi

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Short question about monitor operation

2009-01-26 Thread Tobias Appel
On Mon, 2009-01-26 at 11:06 +0100, Michael Schwartzkopff wrote:

> > Now I'm not sure if I still need a monitor operation. Does the ocf
> > script monitoring automatically?
> No.
> 
> > With a lsb script I always added a status operation. So do I have to add
> > a monitor operation now?
> Yes. heartbeat is even that clever, that it understand a monitor operation 
> for 
> lsb and calls the RA with status.
> 
> > What about the ocf script for mysql? I've put a test user and password
> > there, do I still need to add an operation?
> Yes. Only operations do update the status in heartbeat. If heartbeat does not 
> get the idea that a resource failed it cannot react.
> On the other hand: A failed monitor operation tells heartbeat to do something 
> with that resource, i.e. restart, fence, ... This reaction is up to your 
> config.

Thanks for the clarifications. I'm going to add some monitor operations
then :)


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Short question about monitor operation

2009-01-26 Thread Tobias Appel
Hi all,

I had problems with some ocf scripts in the past so I used the lsb for
apache, but today I thought I will give it another try and it worked on
my test machine perfectly, I had a bit of trouble on another machine
even though it was the same setup, but I still want to go with the ocf.

Now I'm not sure if I still need a monitor operation. Does the ocf
script monitoring automatically? 
With a lsb script I always added a status operation. So do I have to add
a monitor operation now?

What about the ocf script for mysql? I've put a test user and password
there, do I still need to add an operation?

Thanks in advance.

Regards,
Tobi

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Re: Monitoring Apache

2009-01-09 Thread Tobias Appel
> Hi,
> 
> On Fri, Jan 09, 2009 at 08:47:02AM +, Thomas Mueller wrote:
> > 
> > > crmd[11077]: 2008/12/27_13:09:16 info: do_lrm_rsc_op: Performing
> > > op=WebServerApache_start_0 key=7:3:e4fc23e6-92ab-4683-99ba-df387c448e32)
> > > lrmd[11074]: 2008/12/27_13:09:16 info: rsc:WebServerApache: start
> > > apache[11106][4]: 2008/12/27_13:09:16 ERROR: Cannot parse config
> > > file [/etc/apache2/httpd.conf]
> > > 
> > > There is no /etc/apache2/httpd.conf for it to find and parse.  So that
> > > is coded somewhere, and I'll need to track that down and start again.
> > > 
> > 

You said you were on RHEL4 iirc ? Well I'm using RHEL5 and this is my
quick and dirty setup for apache:

 
   
 
   
 

Instead of using the supplied OCF script (which is recommended btw) I
just call the standard init script for apache which is httpd for RHEL5.
And I added the monitor operation which just checks if the process is
alive. In my case it worked, if the apache process died but it will not
check the webserver to see if it really returns a html page (I use
Nagios for this anyway, so I don't really need it).

I also had trouble with the OCF script on RHEL5 that's why I went for
the lsb script. But again, the OCF is recommended as it has more
functionality and lets you monitor the resource better.

Bye,
Tobi

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resource becomes unmanaged when trying to reboot server

2009-01-09 Thread Tobias Appel
On Mon, 2008-12-22 at 19:42 +0100, Dejan Muhamedagic wrote:
> Hi,
> 
> On Mon, Dec 22, 2008 at 12:18:06PM +0100, Tobias Appel wrote:
> > Hi,
> > 
> > sorry to bug you guys again before christmas but I have a very weird
> > error.
> > I have a 2 node setup with drbd and Heartbeat 2.14. One resource group
> > which contains Nagios (something like BigBrother).
> > 
> > Now I configured everything and did some tests with starting and stoping
> > heartbeat service on the servers - the failover worked. 
> > 
> > But if I run 'shutdown -r now' on the active node the server will not
> > reboot and the resource group will not be moved to the passive node. 
> > When I run crm_mon I can see:
> >  nagios-core (lsb:nagios):   Started node01 (unmanaged) FAILED
> > 
> > The server will do nothing then. It will not reboot, the rest of the
> > resource group is still running! The log file from nagios tells me it
> > correctly shutdown. I did browse through the big big ha-log but I
> > couldn't find anything that would help me.
> > 
> > pengine[27246]: 2008/12/22_11:47:11 WARN: unpack_rsc_op: Processing
> > failed op nagios-core_stop_0 on node01: Error
> > 
> > I really have no idea what to look for or what to do. 
> 
> A resource failed to stop. That's typically a reason to kill the
> node, but you probably don't have stonith setup. If a resource
> can't be stopped and there's no stonith enabled, then that
> resource can't be started anywhere.
> 
> Thanks,
> 
> Dejan

Hi,

and happy new year everybody - just came back from holiday.

You are right I don't have stonith enabled because I don't really
understand it fully yet. I know what it means and what it should do but
I thought it works as fencing in conjunction with a UPS or fibre-channel
switch device.

It is correct that the problem is that the resource can not be stopped -
or at least the CRM thinks it can not be stopped. I had the same problem
with the RedHat Cluster Software on the same server - it also could not
stop the nagios resource and the cluster was in a failed state.

Now what you are saying is that stonith would be my solution. When I
turn off one cluster node and the resource goes into an unmanaged state,
the other node could declare it as dead and go online?

Can anyone please point me to a stonith how-to which is not based on a
UPS or something like this? I also can't much about in in the book from
Dr. Schwartzkopff :(
This would be really helpful.

Thanks in advance,

Tobias

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Resource becomes unmanaged when trying to reboot server

2008-12-22 Thread Tobias Appel
Hi,

sorry to bug you guys again before christmas but I have a very weird
error.
I have a 2 node setup with drbd and Heartbeat 2.14. One resource group
which contains Nagios (something like BigBrother).

Now I configured everything and did some tests with starting and stoping
heartbeat service on the servers - the failover worked. 

But if I run 'shutdown -r now' on the active node the server will not
reboot and the resource group will not be moved to the passive node. 
When I run crm_mon I can see:
 nagios-core (lsb:nagios):   Started node01 (unmanaged) FAILED

The server will do nothing then. It will not reboot, the rest of the
resource group is still running! The log file from nagios tells me it
correctly shutdown. I did browse through the big big ha-log but I
couldn't find anything that would help me.

pengine[27246]: 2008/12/22_11:47:11 WARN: unpack_rsc_op: Processing
failed op nagios-core_stop_0 on node01: Error

I really have no idea what to look for or what to do. 

Best Regards,
Tobi

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Heartbeat restarts all resources when an offline node comes back online

2008-12-22 Thread Tobias Appel
Hi,

I'm not sure if this has been asked before but is it normal that
Heartbeat stops and starts all resources if an offline node comes back
online?

I have a 2 node cluster as test system right now with no location
constraints and no auto-fallback. I have one resource group and one
master / slave resource for drbd. All this is running on Heartbeat 2.14
(from SUSE repo for RHEL5)

Now lets assume node01 is running in master and thus running the
resource group. I turn off heartbeat on node02 for maintenance but when
I start heartbeat again no node02 and it comes back online, the complete
resource group is stopped and started on node01 again. Of course this
only takes seconds but is this really the default behavior ?

Best Regards and merry X-mas to the whole list,

Tobias

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Updated MailTo OCF script

2008-12-19 Thread Tobias Appel
Hi,

I've wanted to test out your MailTo script, so I placed it
under /usr/lib/ocf/resource.d/heartbeat/MailTo2
but when I open the GUI now and try to add a native resource I get the
following message:
Failed to parse the metadata of resource agent: MailTo2

I've had a quick look at the metadata but it looks identical to the
original one. 

Anyone else had this problem?

Regards,
Tobi


On Fri, 2008-12-12 at 12:29 +, Todd, Conor (PDX) wrote:
> Hi everyone.
> 
> I've modified the MailTo OCF script to accept a 'sender' attribute, such 
> that, if given, the 'From' field on the email to be sent is set to whatever 
> you set the sender attribute to.
> 
> This was necessary for me because my nodes were putting invalid domains on 
> the sender address (r...@hostname.local), even 
> though each node has a valid hostname.  Additionally, for me, it's more 
> useful to have emails appear to come from the DNS address which presents the 
> clustered service which is generating the message instead of an arbitrary 
> cluster node (which appears in the title of the email anyway).
> 
> I've attached the updated MailTo script for your enjoyment.
> BTW, the original came from a Heartbeat-2.1.4 install.
> 
> Thanks!
> 
>- Conor
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Notify when a takeover happens

2008-12-15 Thread Tobias Appel
Hey Mario,

in case you have just recently subscribed to the list, Todd Conor send
us an updated version of the MailTo script - see his notes below. I
haven't tested it personally yet, but I'm planning to do this today.
Maybe this is helpful to you as well.

Regards,
Tobias

Am Montag, 15. Dezember 2008 10:18 schrieb m...@bortal.de:
> Hello,
>
> is there a way to send out a notify mail if a takeover takes place?
> I coul write a script which gets executed by ha, but maybe this way
> already implemented?
>
> Cheers,
> Mario

On Fri, 2008-12-12 at 12:29 +, Todd, Conor (PDX) wrote:
> Hi everyone.
> 
> I've modified the MailTo OCF script to accept a 'sender' attribute, such 
> that, if given, the 'From' field on the email to be sent is set to whatever 
> you set the sender attribute to.
> 
> This was necessary for me because my nodes were putting invalid domains on 
> the sender address (r...@hostname.local), even 
> though each node has a valid hostname.  Additionally, for me, it's more 
> useful to have emails appear to come from the DNS address which presents the 
> clustered service which is generating the message instead of an arbitrary 
> cluster node (which appears in the title of the email anyway).
> 
> I've attached the updated MailTo script for your enjoyment.
> BTW, the original came from a Heartbeat-2.1.4 install.
> 
> Thanks!
> 
>- Conor
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems


MailTo
Description: application/shellscript
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] DRBD not syncing

2008-12-12 Thread Tobias Appel
Thanks, it really was a split-brain, dunno how I managed to get it, but
the DRBD documentation helped me out.

drbdadm -- --discard-my-data connect resource
drbdadm connect resource

These two commands helped me out. Now it's working again.

Thanks for the hint.

Bye,
Tobias

On Fri, 2008-12-12 at 13:09 +0100, Andreas Haase wrote:
> Hi,
> 
> > The active node says:  0: cs:WFConnection st:Primary/Unknown
> > ds:UpToDate/DUnknown C r---
> >
> > Whereas the passive node tells me:  0: cs:StandAlone
> > st:Secondary/Unknown ds:UpToDate/DUnknown   r---
> 
> sounds like split-brain. Take a look into your systems log files to check
> this.
> 
> Bye,
> Andreas
> 
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] DRBD not syncing

2008-12-12 Thread Tobias Appel
Hi,

I have a weird problem with DRBD.
I've created my Master / Slave Resource for the DRBD Service and got
another resource group which will mount the filesystem. I'm also using
the correct constraints and so on.
If I have a failover Heartbeat will set the active node to primary and
then mount the filesystem. So far so good.
But DRBD is not syncing between those nodes.
The active node says:  0: cs:WFConnection st:Primary/Unknown
ds:UpToDate/DUnknown C r---

Whereas the passive node tells me:  0: cs:StandAlone
st:Secondary/Unknown ds:UpToDate/DUnknown   r---


I haven't touched the DRBD config file but before I added it to
Heartbeat I tested it manually and it worked and was syncing just fine.

I don't even know where to start to look for an error. drbd.conf ? The
configuration of my constraints or the m/s resource?

Any advice is much appreciated!

Regards,
Tobias 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Change the order within a resource group

2008-12-12 Thread Tobias Appel
Hi,

how can I change the order within a resource group?
The UI has buttons for this but they do nothing at all.
Should I save my resources in a temp xml file and the rearrange them
manually and then update my CIB, or even drop my resource section before
doing it?

Regards,
Tobias

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Need help with DRBD constraints please

2008-12-08 Thread Tobias Appel
On Mon, 2008-12-08 at 10:07 +0100, Dominik Klein wrote:

> change the colocation and order to point to the group, not the fs.
> 

Thanks a lot, the above was the important part. It seems to work fine
now!

Bye,
Tobias

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Need help with DRBD constraints please

2008-12-08 Thread Tobias Appel
The Problem with my constraints can be seen quite clear in crm_mon in
this case:

Node: node02 (89cd12ec-240b-472e-acac-514125719494): online
Node: node01 (64a97ed3-da5a-481d-b7d8-fe6cf26544f9): online

Resource Group: nagios
httpd   (lsb:httpd):Started node02
nagios-vip  (ocf::heartbeat:IPaddr2):   Started node02
nagios-core (lsb:nagios):   Started node02
mysql   (ocf::heartbeat:mysql): Started node02
drbd_fs (ocf::heartbeat:Filesystem):Stopped
Master/Slave Set: ms_drbd
drbd:0  (ocf::heartbeat:drbd):  Started node02
drbd:1  (ocf::heartbeat:drbd):  Master node01

node01 is set to master for DRBD but the resource group is run on
node02, thus it can not mount the filesystem. Somehow it seems like I
have an auto autofailback. Since I did not set any constraints for
prefered node the cluster still wants to have the resource group run
only on node02 as soon as it is online for some reason.

I'm not quite sure how I can tell him to leave the group where it is
(the score system is quite confusing to me).

If I shutdown node02 now the resource group will come up on node01 but
as soon as node02 goes online again the resource group fails back to
node02 but without the drbd setting node02 to master.


On Mon, 2008-12-08 at 09:37 +0100, Tobias Appel wrote:
> Hi,
> 
> I still have trouble setting up DRBD v8. I'm not sure if it is related
> to the OCF bug found by Marc Cousin. I think it has something to do with
> my constraints not working properly.
> 
> I have one resource group which should start my services in the
> following order: set IP adress, mount DRBD Filesystem, start webservices
> and so on (the order in my CIB is currently mixed up and I can't seem to
> change it via the GUI but I will work on that later, right now the
> Filesystem does NOT need to be mounted for any of the other services but
> later on it will need to be).
> 
> Then I have created my Master / Slave resource for DRBD which I could
> not add to the group. Afaik a M/S resource can not be part of a group.
> 
> Now I need a constraint that the resource group runs on the same node
> which is the DRBD master. I tried it with the setup found in Dr.
> Schwartzkopffs book but I could not get it to work.
> 
> Maybe it does not work because I'm using the resource to mount the
> filesystem inside the resource group? Do I need to have a single
> resource for the filesystem and then add a 2nd order constraints to
> start the group after the filesystem?
> 
> Right now it does not even mount the filesystem anymore - without any
> constraints I at least got it working once (was a 50:50 chance,
> depending which node is currently the master).
> 
> I have attached my current CIB.
> 
> Thanks in advance for your help,
> 
> Tobias
> 
> 
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Need help with DRBD constraints please

2008-12-08 Thread Tobias Appel
Hi,

I still have trouble setting up DRBD v8. I'm not sure if it is related
to the OCF bug found by Marc Cousin. I think it has something to do with
my constraints not working properly.

I have one resource group which should start my services in the
following order: set IP adress, mount DRBD Filesystem, start webservices
and so on (the order in my CIB is currently mixed up and I can't seem to
change it via the GUI but I will work on that later, right now the
Filesystem does NOT need to be mounted for any of the other services but
later on it will need to be).

Then I have created my Master / Slave resource for DRBD which I could
not add to the group. Afaik a M/S resource can not be part of a group.

Now I need a constraint that the resource group runs on the same node
which is the DRBD master. I tried it with the setup found in Dr.
Schwartzkopffs book but I could not get it to work.

Maybe it does not work because I'm using the resource to mount the
filesystem inside the resource group? Do I need to have a single
resource for the filesystem and then add a 2nd order constraints to
start the group after the filesystem?

Right now it does not even mount the filesystem anymore - without any
constraints I at least got it working once (was a 50:50 chance,
depending which node is currently the master).

I have attached my current CIB.

Thanks in advance for your help,

Tobias


 
   
 
   
 
 
 
 
 
   
 
 
   
 
   
 
 
   
 
   
 
   
   
 
   
 
 
   
 
   
 
 
   
 
   
   
   
   
   
 
   
 
 
   
 
   
   
   
 
   
   
 
   
 
   
   
 
   
 
 
 
 
 
 
 
   
 
 
   
 
   
   
   
   
   
   
 
   
 
   
 

 
   
 
   
 
   
   
   
 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] DRBD 8 with Heartbeat V2

2008-12-05 Thread Tobias Appel
Hi,

I remember the survey on the mailinglist not long ago but I don't really
remember the results. 
Is DRBD 8 currently working with Heartbeat 2.14.x ? 
I got the book from Dr. Schwartzkopff in which he explains an example
with DRBD 0.7 and I tried to use this example with DRBD 8 but it fails
for (yet) unknown reasons. 

I got DRBD 8 to work with Heartbeat V1 just easily. The only difference
is that I had the DRBD service started on both nodes automatically at
startup and Heartbeat just mounted the filesystem and made the active
node primary. I used drbddisk and the Filesystem script on this
environment.

Now for Heartbeat V2 I created a Master/Slave Resource for the drbd
service then a primitive for the Filesystem within my resource group and
an order and colocation constraint, just like in the book. But right now
it fails to start and I can't my resource group anymore.

So if there are known problems with the OCF Agent and DRBD 8 please let
me know. I don't know yet how to 'downgrade' to DRBD 0.7 because I used
yum to install it on RHEL5 but I guess I will have to manually compile
it then (unless there is a way to tell yum to install version 0.7
somehow).

Thanks in advance,

Tobias

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] How to send an email after failover?

2008-12-05 Thread Tobias Appel
Hi,

I wonder how to implement a notification if a failover of a resource
(group) has occured. For Heartbeat1 this was very simple done in the
haresource file. 
But how can I do this in Heartbeat V2 ? I just want one notification
being sent out if any resource fails over to another node.

Thanks in advance,

Tobias

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Can not restart a node while it's connected to the cluster

2008-12-01 Thread Tobias Appel
Hi,

I have installed Heartbeat on 2 Test Machines running inside VMWare
(running RHEL5) and configured a native / primitive resource (Virtual IP
and Apache). But when I try to restart a Node with 'shutdown -r now' it
does not restart but just halts with the error message attached as
screenshot (I can't copy & paste it).
This happened to me when using the latest Heartbeat release 2.99 with
Pacemaker 1.0 but also with Heartbeat 2.1.4 - same error message and the
machine just stops without restarting.
I have to press the virtual reset button in order to get it working
again. Of course other services are stopped correctly so I can't ssh
into it anymore.
If I manually stop the heartbeat service it does work and all the
services fail over correctly and I can then reboot the machine.

Any suggestions what I can try to avoid this? In a real environment I
can't always run in the datacentre to physically reset the machine.

Thanks in advance,

Tobias
<>___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Need some help with environmental variables please ($OCF_ROOT not found)

2008-11-24 Thread Tobias Appel
Thanks a lot! That did work.
If it is set by the cluster I don't have to worry about it either. Good
to know.

Regards,
Tobias

On Mon, 2008-11-24 at 09:12 +0100, Dominik Klein wrote:
> OCF_ROOT is set by the cluster.
> 
> If you want to test stuff on the shell, you should:
> 
> export OCF_ROOT=/usr/lib/ocf
> 
> or whatever is your OCF_ROOT. It does _not_ hold the path to the 
> individual script.
> 
> Regards
> Dominik


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Need some help with environmental variables please ($OCF_ROOT not found)

2008-11-24 Thread Tobias Appel
Morning,

sorry for asking these questions but I'm kinda stuck here.
I installed the rpms from Opensuse for RHEL5 and last week I had
problems getting my OCF resources started. 
So I tried to manually start one of the OCF scripts just to see their
output:

[EMAIL PROTECTED] heartbeat]# ./IPaddr
./IPaddr: line 32: /resource.d/heartbeat/.ocf-shellfuncs: No such file
or directory

Ok so I go to line 32 and it says:

. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs

So the problem is that OCF_ROOT is not in the PATH variable and the
script can't find .ocf-shellfuns. If I replace OCF_ROOT with the full
path the script does not return an error. 
I tried setting the OCF_ROOT variable but it does not work if I just
type in:
OCF_ROOT=/usr/lib/ocf
the script still can't find the files. So I had a look at the other
files and tried this from .ocf-shellfuncs:

[EMAIL PROTECTED] heartbeat]# ${OCF_ROOT=/usr/lib/ocf}
-bash: /usr/lib/ocf: is a directory

Now odly enough, this morning my resource came up without any problems.
I have no idea why, or if it is even related to this environmental
variable issue. Maybe I'm beating on a dead horse there since it is not
an error or should I continue to look into it?

I tried to find any file which specifies OCF_ROOT but the only file I
found would be .ocf-shellfuncs which doesn't return an error when
executed but it will still not work. /etc/ha.d/shellfuncs does not
declare OCF_ROOT.

Well just let me know please what you think. Should I further
investigate this or just ignore it?

Best Regards,
Tobias

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Re: error on compiling HA on Redhat 5

2008-11-21 Thread Tobias Appel
On Fri, 2008-11-21 at 05:54 -0800, haresh ghoghari wrote:
> Dear  Friends
> I am new for LInux-HA
> 
> I am trying to install Linux-HA on redhat5 and I am getting following error 
> while it ./ConfugureMe make
> 
>  Heartbeat-STABLE-2-1-STABLE-2.1.4
> 
> Step  1 : This is working fine  
> ./ConfigureMe 
> configure  --prefix=/naccrraware/ha --sysconfdir=/naccrraware/ha/etc
> --localstatedir=/naccrraware/ha/var --mandir=/usr/share/man
> --disable-rpath
> 
> Step 2 : 
> ./ConfigureMe make
> 
> 
> gmake[2]: Leaving directory 
> `/naccrraware/softwares/Heartbeat-STABLE-2-1-STABLE-2.1.4/telecom/apphbd'
> Making all in recoverymgrd
> gmake[2]: Entering directory 
> `/naccrraware/softwares/Heartbeat-STABLE-2-1-STABLE-2.1.4/telecom/recoverymgrd'
> if
> gcc -DHAVE_CONFIG_H -I. -I. -I../../include -I../../include
> -I../../include -I../../include -I../../libltdl -I../../libltdl
> -I../../linux-ha -I../../linux-ha  -D_BSD_SOURCE -D__BSD_SOURCE
> -D__FAVOR_BSD -DHAVE_NET_ETHERNET_H  -I/usr/include/glib-2.0
> -I/usr/lib/glib-2.0/include   -I/usr/include/libxml2  -g -O2  -Wall
> -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes
> -Wdeclaration-after-statement -Wpointer-arith -Wwrite-strings
> -Wcast-qual -Wcast-align -Wbad-function-cast -Winline
> -Wmissing-format-attribute -Wformat=2 -Wformat-security
> -Wformat-nonliteral -Wno-long-long -Wno-strict-aliasing -Werror  -ggdb3
> -funsigned-char -MT recoverymgrd-conf_yacc.o -MD -MP -MF
> ".deps/recoverymgrd-conf_yacc.Tpo" -c -o recoverymgrd-conf_yacc.o `test
> -f 'conf_yacc.c' || echo './'`conf_yacc.c; \
>
> then mv -f ".deps/recoverymgrd-conf_yacc.Tpo"
> ".deps/recoverymgrd-conf_yacc.Po"; else rm -f
> ".deps/recoverymgrd-conf_yacc.Tpo"; exit 1; fi
> cc1: warnings being treated as errors
> conf_yacc.c:14: warning: function declaration
>  isnât a prototype
> conf_yacc.c:190: warning: function declaration isnât a prototype
> gmake[2]: *** [recoverymgrd-conf_yacc.o] Error 1
> gmake[2]: Leaving directory 
> `/naccrraware/softwares/Heartbeat-STABLE-2-1-STABLE-2.1.4/telecom/recoverymgrd'
> gmake[1]: *** [all-recursive] Error 1
> gmake[1]: Leaving directory 
> `/naccrraware/softwares/Heartbeat-STABLE-2-1-STABLE-2.1.4/telecom'
> make: *** [all-recursive] Error 1
> 
> 
> Thanks
> -Haresh
> 

For RHEL5 you can also use the rpm's from here:
http://download.opensuse.org/repositories/server:/ha-clustering/RHEL_5/

These should work for you.

Have a nice weekend.

Tobias 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Outdated documentation for manually editing CIB

2008-11-21 Thread Tobias Appel
On Fri, 2008-11-21 at 13:04 +0100, Michael Schwartzkopff wrote:
> Am Freitag, 21. November 2008 12:39 schrieb Tobias Appel:
> > Hi,
> >
> > sorry for asking so many newbie questions but I just started out with
> > Heartbeat V2. After reading through all the documentation and even
> > watching the presentation from Linux.Conf.Au in 2007 I wanted to start
> > adding my first resource to the cluster.
> > But every time I get the following error:
> >
> > Call cib_replace failed (-47): Update does not conform to the configured
> > schema/DTD
> > 
> >
> >
> > I tried to copy & stuff from the Novell Documentation here:
> > http://www.novell.com/documentation/sles10/heartbeat/index.html?page=/docum
> >entation/sles10/heartbeat/data/heartbeat.html
> >
> > Or from the slides of the Linux.Conf.Au presentation. I tried simple
> > examples like this:
> >
> >  > type=”IPaddr” provider=”heartbeat”>
> >  
> >   
> >  > name=”ip”
> > value=”192.168.224.5”/>
> >   
> >  
> > 
> 
> The  does not have an id. Add it.
> 

I think I need even more. I played around with the GUI for a bid and
then once I finally got it to accept my input I extracted my
configuration to an xml and it looks more like this:



  


  

  

So we have instance and meta attributes now, and not just an attributes
section. Just adding an ID to nvpair didn't do the trick.
That's why I was asking for updated documentation.

Regards,
Tobias


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Outdated documentation for manually editing CIB

2008-11-21 Thread Tobias Appel
Hi,

sorry for asking so many newbie questions but I just started out with
Heartbeat V2. After reading through all the documentation and even
watching the presentation from Linux.Conf.Au in 2007 I wanted to start
adding my first resource to the cluster.
But every time I get the following error:

Call cib_replace failed (-47): Update does not conform to the configured
schema/DTD



I tried to copy & stuff from the Novell Documentation here: 
http://www.novell.com/documentation/sles10/heartbeat/index.html?page=/documentation/sles10/heartbeat/data/heartbeat.html

Or from the slides of the Linux.Conf.Au presentation. I tried simple
examples like this:


 
  

  
 


Now am I just being stupid or did the software change in the meantime
and I can scrap all the old documentation?
I don't really manage to add resources with the UI either since the
documentation on the UI is really outdated as well.

So is it just me or do I need a revised documentation?

Best Regards,
Tobias 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


RE: [Linux-HA] Python error with hb_gui on RHEL5

2008-11-21 Thread Tobias Appel
On Fri, 2008-11-21 at 10:04 +, David Lee wrote:
> On Fri, 21 Nov 2008, Tobias Appel wrote:
> 
> > On Thu, 2008-11-20 at 21:40 +0900, HIDEO YAMAUCHI wrote:
> >
> >> I made a source code of GUI latest.
> >> I confirmed an error same as you.
> >>
> >> However, the error is broken off if I put a libpacemaker-devel package.
> >> Did you install a libpacemaker-devle package?
> >
> > I really did forget to install this package, only had the pygui-devel
> > package. I did so this morning but I still receive an error when trying
> > to compile, but this time it's different:
> >
> > gcc -I/usr/include/heartbeat -I/usr/include/pacemaker -fgnu89-inline
> > -Wall -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes
> > -Wdeclaration-after-statement -Wpointer-arith -Wwrite-strings
> > -Wcast-qual -Wcast-align -Wbad-function-cast -Winline
> > -Wmissing-format-attribute -Wformat=2 -Wformat-security
> > -Wformat-nonliteral -Wno-long-long -Wno-strict-aliasing -Werror -ansi
> > -D_GNU_SOURCE -DANSI_ONLY -ggdb3 -funsigned-char -o .libs/mgmtd
> > mgmtd.o  ../../lib/mgmt/.libs/libhbmgmtclient.so -L/lib
> > -L/usr/lib/openais ../../lib/mgmt/.libs/libhbmgmttls.so 
> > ./.libs/libhbmgmt.so 
> > /root/Pacemaker-Python-GUI-1bcbdb6cc281/lib/mgmt/.libs/libhbmgmttls.so 
> > -lgnutls -lcib -lcrmcommon -lapphb -lpe_status -lhbclient -lccmclient -lclm 
> > -lSaMsg -llrm ../../lib/mgmt/.libs/libhbmgmtcommon.so -lglib-2.0 -lplumb 
> > -lxml2 -lc -lpam -lrt -ldl  -Wl,--rpath -Wl,/usr/lib/openais
> > /usr/lib/gcc/i386-redhat-linux/4.1.2/../../../libpe_status.so: undefined
> > reference to `stdscr'
> > /usr/lib/gcc/i386-redhat-linux/4.1.2/../../../libpe_status.so: undefined
> > reference to `wmove'
> > /usr/lib/gcc/i386-redhat-linux/4.1.2/../../../libpe_status.so: undefined
> > reference to `printw'
> > collect2: ld returned 1 exit status
> 
> Those look like "curses" or "ncurses" routines for handling typing and 
> cursor movement across a screen (as in "vi", etc.).
> 
> My guess is that you'll need some sort of "ncurses" package, and possibly 
> a corresponding "ncurses-devel".
> 
> I'm not a Linux expert.  But I often find that installing an rpm and 
> building (compiling) that same software from source can be different.  If 
> the simple install requires (say) "pkgA" and "pkgB", building it often 
> (usually?) also requires "pkgA-devel" and "pkgB-devel".
> 
> So if you look inside the rpm to see what packages it requires for 
> installation (e.g. "pkgA", "pkgB", etc.) there can be a reasonable chance 
> that for building it you'll also need the corresponding "-devel" rpm.
> 
> Hope that helps.
> 

Hi David,

your guess was absolutely correct. I only needed ncurses-devel package -
it finally compiled :)
I don't know if I should have had this package on a standard
installation, if yes then the GUI should compile on any RHEL5 system if
you just installed all the -devel Packages from Heartbeat and Pacemaker
(& GUI).

Thanks for all the help!

Regards,
Tobi

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


RE: [Linux-HA] Python error with hb_gui on RHEL5

2008-11-21 Thread Tobias Appel
On Thu, 2008-11-20 at 21:40 +0900, HIDEO YAMAUCHI wrote:

> I made a source code of GUI latest.
> I confirmed an error same as you.
> 
> However, the error is broken off if I put a libpacemaker-devel package.
> Did you install a libpacemaker-devle package?

I really did forget to install this package, only had the pygui-devel
package. I did so this morning but I still receive an error when trying
to compile, but this time it's different:

gcc -I/usr/include/heartbeat -I/usr/include/pacemaker -fgnu89-inline
-Wall -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes
-Wdeclaration-after-statement -Wpointer-arith -Wwrite-strings
-Wcast-qual -Wcast-align -Wbad-function-cast -Winline
-Wmissing-format-attribute -Wformat=2 -Wformat-security
-Wformat-nonliteral -Wno-long-long -Wno-strict-aliasing -Werror -ansi
-D_GNU_SOURCE -DANSI_ONLY -ggdb3 -funsigned-char -o .libs/mgmtd
mgmtd.o  ../../lib/mgmt/.libs/libhbmgmtclient.so -L/lib
-L/usr/lib/openais ../../lib/mgmt/.libs/libhbmgmttls.so ./.libs/libhbmgmt.so 
/root/Pacemaker-Python-GUI-1bcbdb6cc281/lib/mgmt/.libs/libhbmgmttls.so -lgnutls 
-lcib -lcrmcommon -lapphb -lpe_status -lhbclient -lccmclient -lclm -lSaMsg 
-llrm ../../lib/mgmt/.libs/libhbmgmtcommon.so -lglib-2.0 -lplumb -lxml2 -lc 
-lpam -lrt -ldl  -Wl,--rpath -Wl,/usr/lib/openais
/usr/lib/gcc/i386-redhat-linux/4.1.2/../../../libpe_status.so: undefined
reference to `stdscr'
/usr/lib/gcc/i386-redhat-linux/4.1.2/../../../libpe_status.so: undefined
reference to `wmove'
/usr/lib/gcc/i386-redhat-linux/4.1.2/../../../libpe_status.so: undefined
reference to `printw'
collect2: ld returned 1 exit status
gmake[2]: *** [mgmtd] Error 1
gmake[2]: Leaving directory
`/root/Pacemaker-Python-GUI-1bcbdb6cc281/mgmt/daemon'
gmake[1]: *** [all-recursive] Error 1
gmake[1]: Leaving directory
`/root/Pacemaker-Python-GUI-1bcbdb6cc281/mgmt'
gmake: *** [all-recursive] Error 1

I now have the following packages installed:

libpacemaker-devel.i386  1.0.1-2.1  
libpacemaker3.i386   1.0.1-2.1  
pacemaker.i386   1.0.1-2.1  
pacemaker-pygui.i386 1.4-11.8   
pacemaker-pygui-devel.i386   1.4-11.8   
heartbeat.i386   2.99.2-6.1 
heartbeat-common.i3862.99.2-6.1 
heartbeat-devel.i386 2.99.2-6.1 
heartbeat-resources.i386 2.99.2-6.1 
libheartbeat-devel.i386  2.99.2-6.1 
libheartbeat2.i386   2.99.2-6.1 
libopenais-devel.i3860.80.3-12.2
libopenais2.i386 0.80.3-12.2
openais.i386 0.80.3-12.2


In the meantime I got it compiled on Fedora 9 so it's not that important
issue for me right away but still it would be nice if the GUI would run
on RHEL5

Best Regards,
Tobias Appel

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Python error with hb_gui on RHEL5

2008-11-20 Thread Tobias Appel
On Thu, 2008-11-20 at 18:55 +0800, Yan Gao wrote:

> 
> > 
> > Should I try to install the GUI on a seperate machine which is not in
> > the cluster, and then configure the cluster from 'the outside'?
> It's up to you. The GUI can run either locally or remotely (TCP/IP,
> huh?:-)). But you need at least a backend running on one of the cluster
> nodes.

> Did you install the pacemaker-devel package?

Yes I did. Still got the same error when trying 'make'.

But I got it fully working now on a different machine running Fedora 9.
Without any Python errors. So far so good. Thanks for your help!

Regards,
Tobias Appel
> 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Python error with hb_gui on RHEL5

2008-11-20 Thread Tobias Appel
On Thu, 2008-11-20 at 18:19 +0900, [EMAIL PROTECTED] wrote:
> Hi,
> 
> > So did anyone get it to work on RHEL5 ? Right now hb_gui is not useable
> > for me in any way.
> 
> GUI moves on my RHEL5.2.

I am using RHEL 5.2 as well but I did install only the rpm's from
Opensuse. And I think I ran into this error:

http://hg.clusterlabs.org/pacemaker/pygui/rev/460feb8039c1

But even with the newest builts from Opensuse from Nov. 19th I still get
the same error when using the GUI.

Should I try to install the GUI on a seperate machine which is not in
the cluster, and then configure the cluster from 'the outside'?

> 
> Which version will GUI which you use be? 
> Please compile latest GUI, and please use it.
> 
> I can acquire the latest edition from the next link.
>  - http://hg.clusterlabs.org/pacemaker/pygui/
> 
> Best Regards,
> Hideo Yamauchi.
> 


Domo arigatou for the link. I downloaded the latest release and tried to
compile it manually. But when I try 'make' I get the following error:

mgmt_lib.c:44:21: error: crm/crm.h: No such file or directory
mgmt_lib.c: In function 'init_mgmt_lib':
mgmt_lib.c:82: error: 'free' undeclared (first use in this function)
mgmt_lib.c:82: error: (Each undeclared identifier is reported only once
mgmt_lib.c:82: error: for each function it appears in.)
mgmt_lib.c:86: error: 'malloc' undeclared (first use in this function)
mgmt_lib.c:86: error: 'realloc' undeclared (first use in this function)
cc1: warnings being treated as errors
mgmt_lib.c:90: warning: implicit declaration of function
'is_heartbeat_cluster'
mgmt_lib.c: In function 'reg_msg':
mgmt_lib.c:138: warning: implicit declaration of function 'strdup'
mgmt_lib.c:138: warning: passing argument 2 of 'g_hash_table_insert'
makes pointer from integer without a cast
mgmt_lib.c: In function 'reg_event':
mgmt_lib.c:182: warning: passing argument 2 of 'g_hash_table_replace'
makes pointer from integer without a cast
gmake[2]: *** [libhbmgmt_la-mgmt_lib.lo] Error 1
gmake[2]: Leaving directory
`/root/Pacemaker-Python-GUI-1bcbdb6cc281/mgmt/daemon'
gmake[1]: *** [all-recursive] Error 1
gmake[1]: Leaving directory
`/root/Pacemaker-Python-GUI-1bcbdb6cc281/mgmt'
make: *** [all-recursive] Error 1

Sayonara,

Tobias Appel
> 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Python error with hb_gui on RHEL5

2008-11-20 Thread Tobias Appel
Hi,

I get the following error when trying to open any sub-menu within hb_gui
(like edit for a resource).
I installed all packages from opensuse and just updated them to
yesterdays built but the error still is the same.

I think it might be due to my pygobject version which is too low
(2.12.1-5.el5) - but there is no later package available for RHEL5. Also
manually trying to compile the GUI fails with a lot of errors - I
haven't managed to get it to work. :(
I tried to replace just haclient.py.in from the official site, but this
by itself didn't work for me either.

So did anyone get it to work on RHEL5 ? Right now hb_gui is not useable
for me in any way.

Error Message below:

Traceback (most recent call last):
  File "/usr/bin/hb_gui", line 2822, in on_add
objdlg = ObjectViewDlg(new_elem, True)
  File "/usr/bin/hb_gui", line 3900, in __init__
obj_view = ObjectView(self.xml_node, is_newobj, self.on_changed)
  File "/usr/bin/hb_gui", line 1994, in __init__
self.update(xml_node)
  File "/usr/bin/hb_gui", line 2130, in update
self.on_after_show()
  File "/usr/bin/hb_gui", line 2138, in on_after_show
self.obj_attrs.on_after_show()
  File "/usr/bin/hb_gui", line 6471, in on_after_show
for widget in self.widgets[widget_type].values() :
SystemError: Objects/funcobject.c:128: bad argument to internal function


Best Regards,
Tobias Appel


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: Re : [Linux-HA] hb_gui cannot connect to server (on RHEL5)

2008-11-14 Thread Tobias Appel

Hi,

thanks for the link to the guide - I didn't find it before on the 
website. I was really just missing those 2 lines in the configuration.

Now I can connect with the GUI to the cluster.

Thanks again and also thanks to Dr. Schwartzkopff for his help on this 
matter. Greetings from Haar to Grasbrunn :)


Regards,
Tobias Appel

Mohamed SABIR wrote:


Hi,

Did-you add this lines to the ha.cf file ? :

apiauth mgmtd   uid=root
respawn root/usr/lib/heartbeat/mgmtd -v
 

You must restart heartbeat service after. And verify (with ps -ef | 
grep mgmtd) if the mgmtd porcess is started.


See this "http://www.linux-ha.org/GuiGuide"; for more information.

Regards

----
*De :* Tobias Appel <[EMAIL PROTECTED]>
*À :* linux-ha@lists.linux-ha.org
*Envoyé le :* Lundi, 10 Novembre 2008, 11h44mn 03s
*Objet :* [Linux-HA] hb_gui cannot connect to server (on RHEL5)

Hi,

I've installed Heartbeat on RHEL5 using the opensuse repository. After
installing heartbeat and pacemaker and pacemaker-pygui I did the initial
configuration of ha.cf and started the service.
crm_mon shows something like this:


Last updated: Mon Nov 10 12:40:41 2008
Current DC: node02 (89cd12ec-240b-472e-acac-514125719494)
2 Nodes configured.
0 Resources configured.


Node: node01 (64a97ed3-da5a-481d-b7d8-fe6cf26544f9): online
Node: node02 (89cd12ec-240b-472e-acac-514125719494): online

However if I start hb_gui and try to connect any of those servers I
always get: "Can not connect to server XYZ".
I have verified that the user hacluster is there and set a password,
also the group haclient is there.
I can not find anything in the ha-log or in the messages log.

Please tell me what else can I check to get the UI working?

Thanks in advance,

Tobias

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org <mailto:Linux-HA@lists.linux-ha.org>
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] hb_gui cannot connect to server (on RHEL5)

2008-11-10 Thread Tobias Appel
Hi,

I've installed Heartbeat on RHEL5 using the opensuse repository. After
installing heartbeat and pacemaker and pacemaker-pygui I did the initial
configuration of ha.cf and started the service.
crm_mon shows something like this:


Last updated: Mon Nov 10 12:40:41 2008
Current DC: node02 (89cd12ec-240b-472e-acac-514125719494)
2 Nodes configured.
0 Resources configured.


Node: node01 (64a97ed3-da5a-481d-b7d8-fe6cf26544f9): online
Node: node02 (89cd12ec-240b-472e-acac-514125719494): online

However if I start hb_gui and try to connect any of those servers I
always get: "Can not connect to server XYZ".
I have verified that the user hacluster is there and set a password,
also the group haclient is there.
I can not find anything in the ha-log or in the messages log.

Please tell me what else can I check to get the UI working?

Thanks in advance,

Tobias

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems