Re: [Linux-HA] NFS server not started by heartbeat

2007-05-08 Thread Martijn Grendelman
Hi Yan,

 I am trying to build a 2-node cluster serving DRBD+NFS, among other
 things. It has been operational on Debian Sarge, with Heartbeat 1.2, but
 recently, both machines were upgraded to Debian Etch, and today I
 upgraded Heartbeat to 2.0.7. I maintained the R1 style configuration.
 Heartbeat is running in an active/passive fashion.
>> [snip]
>>
>>> We run /etc/init.d/nfs-kernel-server status before starting it.  If it
>>> says OK or running, then we don't start it because it's already running.
>>>
>>> See  http://linux-ha.org/HeartbeatResourceAgent
>> Thank you for the information.
>>
>> There is one other problem that I haven't been able to solve, and I hope
>> someone can help me with that too.
>>
>> Sometimes it happens that Heartbeat tries to take over a resource group
>> that it's already running:
>>
>> [EMAIL PROTECTED]:~> cl_status rscstatus
>> all
>>
>> [EMAIL PROTECTED]:~> cl_status rscstatus
>> none
>>
>> Now, when I shutdown or reboot Vodka, I would expect nothing much to
>> happen in the cluster, but instead, Heartbeat on Whisky, the node that's
>> already running things, says:
>>
>> May  7 17:21:34 whisky mach_down[11872]: [11888]: info: Taking over
>> resource group 213.207.104.20
>> May  7 17:21:34 whisky ResourceManager[11889]: [11897]: info: Acquiring
>> resource group: vodka 213.207.104.20 ipvsadm mon drbddisk::all
>> Filesystem::/dev/drbd0::/extra1::ext3 nfs-kernel-server Delay::3::0
>> IPaddr::10.50.1.20/32/eth0 mysql
>>
>> and it starts running init scripts with the 'start' argument. This is
>> bound to fail, so:
>>
>> May  7 17:21:34 whisky ResourceManager[11889]: [12047]: debug: Starting
>> /etc/init.d/mon  start
>> May  7 17:21:34 whisky ResourceManager[11889]: [12052]: debug:
>> /etc/init.d/mon  start done. RC=1
>> May  7 17:21:34 whisky ResourceManager[11889]: [12053]: ERROR: Return
>> code 1 from /etc/init.d/mon
>> May  7 17:21:34 whisky ResourceManager[11889]: [12054]: CRIT: Giving up
>> resources due to failure of mon
>> May  7 17:21:34 whisky ResourceManager[11889]: [12055]: info: Releasing
>> resource group: vodka 213.20
>> 7.104.20 ipvsadm mon drbddisk::all Filesystem::/dev/drbd0::/extra1::ext3
>> nfs-kernel-server Delay::3::0 IPaddr::10.50.1.20/32/eth0 mysql
>>
>> ... and down goes my entire cluster!!!
>>
>> Why does Heartbeat want to start a resource group that it already runs?
> 
> because mon (whatever that init script is) returned 1 on the start
> action. a return value of 1 indicates to heartbeat that the operation
> failed, and heartbeat can't safely do anything else with that resource.
> 
> Basically, there is a problem with the Resource Agent (RA) "mon". See:
> 
> http://www.linux-ha.org/LSBResourceAgent

No, you are missing the point.

'mon start' returns 1, because Mon is already running, as it should be,
since this is the active node. The question is: why is Heartbeat trying
to start Mon and all other resources, while it already runs all of them?

Thanks you.

Best regards,

Martijn Grendelman
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Cannot create group containing drbd using HB GUI

2007-05-08 Thread Lars Marowsky-Bree
On 2007-05-07T10:47:07, Doug Knight <[EMAIL PROTECTED]> wrote:

> Is bugzilla available today? When I try to access the site, I've gotten
> page not found and also a message that it is being merged with another? 

Just follow the link from that redirect page.


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] NFS server not started by heartbeat

2007-05-08 Thread Lars Marowsky-Bree
On 2007-05-08T09:15:25, Martijn Grendelman <[EMAIL PROTECTED]> wrote:

> No, you are missing the point.
> 
> 'mon start' returns 1, because Mon is already running, as it should be,
> since this is the active node. The question is: why is Heartbeat trying
> to start Mon and all other resources, while it already runs all of them?

No. "start" must succeed if it is already running. (idempotent) Said
script is broken.


Sincerely,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] NFS server not started by heartbeat

2007-05-08 Thread Lars Marowsky-Bree
On 2007-05-07T11:28:44, Dave Dykstra <[EMAIL PROTECTED]> wrote:

> It's probably just my ignorance as to why this isn't good enough,
> but I would think that the cluster software could assume that services
> it started were running and that services it stopped were stopped.

heartbeat v1 doesn't maintain enough state for this. And "start" must
succeed if already started - this remains true with v2.


Sincerely,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Advice with heartbeat on RHAS4

2007-05-08 Thread Peter Sørensen
Hi,

I have been playing around with heartbeat version 2.04-1 and drbd-8.0.0
to setup a High Available MySql server. In my testsetup I use 2 VmWare servers
running RHAS4 (2.6.9-42).

The Mysql is only started on one of the nodes and use a shared IP-address.
Shutting down the active server or just heartbeat => the second node takes
over and start mysql. This works OK.

Now I want to make some monitoring on the mysql application and started reading 
about mon
but at the same time it came to my attention that the 2.0.8 version of heartbeat
had some features to do that along with a GUI interface.   

I started to do a compile of the 2.0.8 version and ran into a lot of problems 
with
the installed python version (2.3.4). Tried to upgrade this and now everything 
is
messed up.

So now I am going back to scratch with a clean RHAS install (2.6.9-42) and 
python 2.3.4.

I would like to know if it is possible to get RPM's to install all the 
nessacary packages
and if where do I find them?

If not what do I need to compile and install to make it work.

The heartbeat-2.0.8 package of course but this seems to require a newer version 
of
python ( the GUI interface) than the one I have installed 
 
Where can I find all the dependencies to make it work?

Regards


Peter Sorensen/University og Southern Denmark/Email: [EMAIL PROTECTED]
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Best effort HA

2007-05-08 Thread Kai Bjørnstad
Unfortunately the the environment the ha server/cluster I am trying to 
configure does not really fit with a grouping of IP/filsystem/lsb.
In short: All the LSB services should be available on the same IP and there is 
not necessarily a mapping between the filesystems and the LSB script (so I 
just have to play it safe) 


The ruleset is not that complicated really, it's just a lot of them :-)

- The IP group has co-location and order on
- The Filesystem and LSB group has co-location on and order off

- Colocation between all Filesystems to the IP-group
- Colocation between all LSB scripts to the IP-group
- Colocation between all LSB scrpts and all Filesystems

- Startorder from all LSB scripts to all Filesystems (this to enable restart)
- Startorder between the groups: IP-group before Filesystems-group before 
LSB-group

What I still do not understand is that a failed LSB-script does not trigger a 
failover?? Neither does a failed Filesystem (it only stops all LSB-scripts).
Only a failed IP trigger a failover.

Does this have anything to do with the "stickiness stuff"?? I  have
default-resource-stickiness = "100"
default-resource-failure-stickiness = "-INFINITY"



On Monday 07 May 2007 18:09:48 Yan Fitterer wrote:
> Haven't looked at too much detail (lots of resources / constraints in
> your cib...), but I would approach the problem differently:
>
> Make groups out of related IP / filesystem / service stacks.
>
> Then use the colocation constraints between services (across groups) to
> force things to move together (if it is indeed what you are trying to
> achieve).
>
> As well, I would start with maybe less resources, to make
> experimentation and troubleshooting easier...
>
> What you describe below would seem broadly possible to me.
>
> My 2c
>
> Yan
>
> Kai Bjørnstad wrote:
> > Hi,
> >
> > I am trying to setup an Active-Passive HA cluster dong "best effort" with
> > little success.
> > I am using Heartbeat 2.0.8
> >
> > I have a set of IP resources, a set of external (iSCSI) mount resources
> > and a set of LSB script resources.
> >
> > The goal of the configuration is to make Heartbeat do the following:
> > - All resources should run on the same node at all times
> > - If one or more of the IPs go down on, move all resources to the backup
> > node. If no backup node is available, shut everything down.
> > - If one or more of the mounts go down, move all resources (including
> > IPs) to the backup node. If no backup node is available shut down all the
> > LSB scripts and the failed mounts. Keep the mounts and IPs that did not
> > fail up. - If one or more of the LSB scripts fail, move all resources to
> > the backup node (including mounts and IPs). If the no backup node is
> > available shut down the failed LSB script(s) but keep all other resoruces
> > running (best effort) - Of course local restart should be attempted
> > before moving to backup node. - Start IPs and Mounts before the LSB
> > scripts
> > - Start/restart order of IPs should not be enforced
> > - Start/restart order of Mounts should not be enforced
> > - Start/restart order of LSBs should not be enforced
> >
> > My question is basically: Is this at all possible???
> >


-- 
Kai R. Bj�rnstad
Senior Software Engineer
dir. +47 22 62 89 43
mob. +47 99 57 79 11
tel. +47 22 62 89 50
fax. +47 22 62 89 51
[EMAIL PROTECTED]

Olaf Helsets vei 6
N0621 Oslo, Norway

Scali - www.scali.com
Scaling the Linux Datacenter
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] setting is_managed to true triggers restart

2007-05-08 Thread Andrew Beekhof

On 5/8/07, Peter Kruse <[EMAIL PROTECTED]> wrote:

Hello all,

with Heartbeat v2.0.8 I have a configuration with the cib.xml as
attached.  After I started the resource groups I did:

crm_resource -p is_managed -r IPaddr1 -t primitive -v false

crm_mon shows:

IPaddr1 (q-leap::ocf:IP_address):   Started ql-xen-1 (unmanaged)

but nothing else happens.  But when I then do:

crm_resource -p is_managed -r IPaddr1 -t primitive -v true

resources are restarted.  Why?  Is that expected?  Is it a bug?
syslog attached.


because you've changed its parameters :-)

for this reason, and to avoid polluting the parameter namespace with
CRM options, we created meta attributes at some point.

you can operate on these by simply adding the --meta option to your
current command line.


however, there is a slight problem in that if is_managed was present
as a regular attribute, then instead of creating/modifying a "meta
attribute" when --meta was supplied, it would find and modify the
"regular attribute" and still cause the resource to be restarted.

we found that out yesterday and i'm working on getting that fixed...
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] setting is_managed to true triggers restart

2007-05-08 Thread Lars Marowsky-Bree
On 2007-05-08T10:03:13, Peter Kruse <[EMAIL PROTECTED]> wrote:

> Hello all,
> 
> with Heartbeat v2.0.8 I have a configuration with the cib.xml as
> attached.  After I started the resource groups I did:
> 
> crm_resource -p is_managed -r IPaddr1 -t primitive -v false
> 
> crm_mon shows:
> 
> IPaddr1 (q-leap::ocf:IP_address):   Started ql-xen-1 (unmanaged)
> 
> but nothing else happens.

Of course not; you set it to unmanaged, so it can't be managed ;-)

> But when I then do:
> 
> crm_resource -p is_managed -r IPaddr1 -t primitive -v true
> 
> resources are restarted.  Why?  Is that expected?  Is it a bug?
> syslog attached.

Use the --meta option to crm_resource; right now you're changing an
instance attribute which will trigger a restart.


Sincerely,
Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Xen-HA - SLES X86_64

2007-05-08 Thread Andrew Beekhof

grep ERROR logfile

try this for starters:

May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
(resource_qclvmsles02:stop:stderr) Error: the domain 'resource_qclvmsles02'
does not exist.
May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
(resource_qclvmsles02:stop:stdout) Domain resource_qclvmsles02 terminated
May  7 16:31:41 qclsles01 crmd: [22028]: WARN: process_lrm_event:lrm.c LRM
operation (35) stop_0 on resource_qclvmsles02 Error: (4) insufficient
privileges



On 5/7/07, Rene Purcell <[EMAIL PROTECTED]> wrote:

I would like to know if someone had tried the Novell setup described in "
http://www.novell.com/linux/technical_library/has.pdf"; with a x86_64 arch ?

I've tested this setup with a classic x86 arch and everything was ok... but
I doublechecked my config and everything look good but my VM never start on
his original node when it come back online... and I can't find why!


here's the log when my node1 come back.. we can see the VM shutting down and
after that nothing happend in the other node..

May  7 16:31:25 qclsles01 cib: [22024]: info:
cib_diff_notify:notify.cUpdate (client: 6403, call:13):
0.65.1020 -> 0.65.1021 (ok)
May  7 16:31:25 qclsles01 tengine: [22591]: info:
te_update_diff:callbacks.cProcessing diff (cib_update):
0.65.1020 -> 0.65.1021
May  7 16:31:25 qclsles01 tengine: [22591]: info:
extract_event:events.cAborting on transient_attributes changes
May  7 16:31:25 qclsles01 tengine: [22591]: info: update_abort_priority:
utils.c Abort priority upgraded to 100
May  7 16:31:25 qclsles01 tengine: [22591]: info: update_abort_priority:
utils.c Abort action 0 superceeded by 2
May  7 16:31:26 qclsles01 cib: [22024]: info: activateCibXml:io.c CIB size
is 161648 bytes (was 158548)
May  7 16:31:26 qclsles01 cib: [22024]: info:
cib_diff_notify:notify.cUpdate (client: 6403, call:14):
0.65.1021 -> 0.65.1022 (ok)
May  7 16:31:26 qclsles01 haclient: on_event:evt:cib_changed
May  7 16:31:26 qclsles01 tengine: [22591]: info:
te_update_diff:callbacks.cProcessing diff (cib_update):
0.65.1021 -> 0.65.1022
May  7 16:31:26 qclsles01 tengine: [22591]: info:
match_graph_event:events.cAction resource_qclvmsles02_stop_0 (9)
confirmed
May  7 16:31:26 qclsles01 cib: [25889]: info: write_cib_contents:io.c Wrote
version 0.65.1022 of the CIB to disk (digest:
e71c271759371d44c4bad24d50b2421d)
May  7 16:31:39 qclsles01 kernel: xenbr0: port 3(vif12.0) entering disabled
state
May  7 16:31:39 qclsles01 kernel: device vif12.0 left promiscuous mode
May  7 16:31:39 qclsles01 kernel: xenbr0: port 3(vif12.0) entering disabled
state
May  7 16:31:39 qclsles01 logger: /etc/xen/scripts/vif-bridge: offline
XENBUS_PATH=backend/vif/12/0
May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
XENBUS_PATH=backend/vbd/12/768
May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
XENBUS_PATH=backend/vbd/12/832
May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
XENBUS_PATH=backend/vbd/12/5632
May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/vif-bridge: brctl delif
xenbr0 vif12.0 failed
May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/vif-bridge: ifconfig
vif12.0 down failed
May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/vif-bridge: Successful
vif-bridge offline for vif12.0, bridge xenbr0.
May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/xen-hotplug-cleanup:
XENBUS_PATH=backend/vbd/12/5632
May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/xen-hotplug-cleanup:
XENBUS_PATH=backend/vbd/12/768
May  7 16:31:40 qclsles01 ifdown: vif12.0
May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/xen-hotplug-cleanup:
XENBUS_PATH=backend/vif/12/0
May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/xen-hotplug-cleanup:
XENBUS_PATH=backend/vbd/12/832
May  7 16:31:40 qclsles01 ifdown: Interface not available and no
configuration found.
May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
(resource_qclvmsles02:stop:stderr) Error: the domain 'resource_qclvmsles02'
does not exist.
May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
(resource_qclvmsles02:stop:stdout) Domain resource_qclvmsles02 terminated
May  7 16:31:41 qclsles01 crmd: [22028]: WARN: process_lrm_event:lrm.c LRM
operation (35) stop_0 on resource_qclvmsles02 Error: (4) insufficient
privileges
May  7 16:31:41 qclsles01 cib: [22024]: info: activateCibXml:io.c CIB size
is 164748 bytes (was 161648)
May  7 16:31:41 qclsles01 crmd: [22028]: info:
do_state_transition:fsa.cqclsles01: State transition
S_TRANSITION_ENGINE -> S_POLICY_ENGINE [
input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ]
May  7 16:31:41 qclsles01 tengine: [22591]: info:
te_update_diff:callbacks.cProcessing diff (cib_update):
0.65.1022 -> 0.65.1023
May  7 16:31:41 qclsles01 cib: [22024]: info:
cib_diff_notify:notify.cUpdate (client: 22028, call:100):
0.65.1022 -> 0.65.1023 (ok)
May  7 16:31:41 qclsles01 crmd: [22028]: info: do_state_transition:fsa.c All
2 cluster nodes are eligable to run resources.
May  7 16:31:41 qclsles01 tengine: [22591]: ERROR: match

Re: [Linux-HA] setting is_managed to true triggers restart

2007-05-08 Thread Peter Kruse

Hi,

thanks for your replies.

Andrew Beekhof wrote:

On 5/8/07, Peter Kruse <[EMAIL PROTECTED]> wrote:

for this reason, and to avoid polluting the parameter namespace with
CRM options, we created meta attributes at some point.

you can operate on these by simply adding the --meta option to your
current command line.


my crm_resource doesn't seem to have this option.




however, there is a slight problem in that if is_managed was present
as a regular attribute, then instead of creating/modifying a "meta
attribute" when --meta was supplied, it would find and modify the
"regular attribute" and still cause the resource to be restarted.

>

we found that out yesterday and i'm working on getting that fixed...


so until then I have to avoid to have is_managed as a regular attribute
in my cib?

Peter

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Advice with heartbeat on RHAS4

2007-05-08 Thread Andrew Beekhof

you could try one of the fedora rpms at:
  http://software.opensuse.org/download/server:/ha-clustering

even just having a peak at the spec file might help.

On 5/8/07, Peter Sørensen <[EMAIL PROTECTED]> wrote:

Hi,

I have been playing around with heartbeat version 2.04-1 and drbd-8.0.0
to setup a High Available MySql server. In my testsetup I use 2 VmWare servers
running RHAS4 (2.6.9-42).

The Mysql is only started on one of the nodes and use a shared IP-address.
Shutting down the active server or just heartbeat => the second node takes
over and start mysql. This works OK.

Now I want to make some monitoring on the mysql application and started reading 
about mon
but at the same time it came to my attention that the 2.0.8 version of heartbeat
had some features to do that along with a GUI interface.

I started to do a compile of the 2.0.8 version and ran into a lot of problems 
with
the installed python version (2.3.4). Tried to upgrade this and now everything 
is
messed up.

So now I am going back to scratch with a clean RHAS install (2.6.9-42) and 
python 2.3.4.

I would like to know if it is possible to get RPM's to install all the 
nessacary packages
and if where do I find them?

If not what do I need to compile and install to make it work.

The heartbeat-2.0.8 package of course but this seems to require a newer version 
of
python ( the GUI interface) than the one I have installed

Where can I find all the dependencies to make it work?

Regards


Peter Sorensen/University og Southern Denmark/Email: [EMAIL PROTECTED]
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] setting is_managed to true triggers restart

2007-05-08 Thread Andrew Beekhof

On 5/8/07, Peter Kruse <[EMAIL PROTECTED]> wrote:

Hi,

thanks for your replies.

Andrew Beekhof wrote:
> On 5/8/07, Peter Kruse <[EMAIL PROTECTED]> wrote:
>
> for this reason, and to avoid polluting the parameter namespace with
> CRM options, we created meta attributes at some point.
>
> you can operate on these by simply adding the --meta option to your
> current command line.

my crm_resource doesn't seem to have this option.


doh :-(


> however, there is a slight problem in that if is_managed was present
> as a regular attribute, then instead of creating/modifying a "meta
> attribute" when --meta was supplied, it would find and modify the
> "regular attribute" and still cause the resource to be restarted.
 >
> we found that out yesterday and i'm working on getting that fixed...

so until then I have to avoid to have is_managed as a regular attribute
in my cib?


right

the blocks look the same, just use meta_attributes instead of
instance_attributes and put the options in there (and use cibadmin to
update them).
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] setting is_managed to true triggers restart

2007-05-08 Thread Peter Kruse

Hi,

Andrew Beekhof wrote:

the blocks look the same, just use meta_attributes instead of
instance_attributes and put the options in there (and use cibadmin to
update them).


That works, thanks!

Peter
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Xen-HA - SLES X86_64

2007-05-08 Thread Rene Purcell

On 5/8/07, Andrew Beekhof <[EMAIL PROTECTED]> wrote:


grep ERROR logfile

try this for starters:

May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
(resource_qclvmsles02:stop:stderr) Error: the domain
'resource_qclvmsles02'
does not exist.
May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
(resource_qclvmsles02:stop:stdout) Domain resource_qclvmsles02 terminated
May  7 16:31:41 qclsles01 crmd: [22028]: WARN: process_lrm_event:lrm.c LRM
operation (35) stop_0 on resource_qclvmsles02 Error: (4) insufficient
privileges



yup I saw that.. it's weird. Heartbeat shutdown the vm, then say these
errors.. and if I cleanup the ressource he restart on the correct node..
There should be something I missed lol


On 5/7/07, Rene Purcell <[EMAIL PROTECTED]> wrote:

> I would like to know if someone had tried the Novell setup described in
"
> http://www.novell.com/linux/technical_library/has.pdf"; with a x86_64
arch ?
>
> I've tested this setup with a classic x86 arch and everything was ok...
but
> I doublechecked my config and everything look good but my VM never start
on
> his original node when it come back online... and I can't find why!
>
>
> here's the log when my node1 come back.. we can see the VM shutting down
and
> after that nothing happend in the other node..
>
> May  7 16:31:25 qclsles01 cib: [22024]: info:
> cib_diff_notify:notify.cUpdate (client: 6403, call:13):
> 0.65.1020 -> 0.65.1021 (ok)
> May  7 16:31:25 qclsles01 tengine: [22591]: info:
> te_update_diff:callbacks.cProcessing diff (cib_update):
> 0.65.1020 -> 0.65.1021
> May  7 16:31:25 qclsles01 tengine: [22591]: info:
> extract_event:events.cAborting on transient_attributes changes
> May  7 16:31:25 qclsles01 tengine: [22591]: info: update_abort_priority:
> utils.c Abort priority upgraded to 100
> May  7 16:31:25 qclsles01 tengine: [22591]: info: update_abort_priority:
> utils.c Abort action 0 superceeded by 2
> May  7 16:31:26 qclsles01 cib: [22024]: info: activateCibXml:io.c CIB
size
> is 161648 bytes (was 158548)
> May  7 16:31:26 qclsles01 cib: [22024]: info:
> cib_diff_notify:notify.cUpdate (client: 6403, call:14):
> 0.65.1021 -> 0.65.1022 (ok)
> May  7 16:31:26 qclsles01 haclient: on_event:evt:cib_changed
> May  7 16:31:26 qclsles01 tengine: [22591]: info:
> te_update_diff:callbacks.cProcessing diff (cib_update):
> 0.65.1021 -> 0.65.1022
> May  7 16:31:26 qclsles01 tengine: [22591]: info:
> match_graph_event:events.cAction resource_qclvmsles02_stop_0 (9)
> confirmed
> May  7 16:31:26 qclsles01 cib: [25889]: info: write_cib_contents:io.cWrote
> version 0.65.1022 of the CIB to disk (digest:
> e71c271759371d44c4bad24d50b2421d)
> May  7 16:31:39 qclsles01 kernel: xenbr0: port 3(vif12.0) entering
disabled
> state
> May  7 16:31:39 qclsles01 kernel: device vif12.0 left promiscuous mode
> May  7 16:31:39 qclsles01 kernel: xenbr0: port 3(vif12.0) entering
disabled
> state
> May  7 16:31:39 qclsles01 logger: /etc/xen/scripts/vif-bridge: offline
> XENBUS_PATH=backend/vif/12/0
> May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
> XENBUS_PATH=backend/vbd/12/768
> May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
> XENBUS_PATH=backend/vbd/12/832
> May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
> XENBUS_PATH=backend/vbd/12/5632
> May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/vif-bridge: brctl
delif
> xenbr0 vif12.0 failed
> May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/vif-bridge: ifconfig
> vif12.0 down failed
> May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/vif-bridge:
Successful
> vif-bridge offline for vif12.0, bridge xenbr0.
> May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/xen-hotplug-cleanup:
> XENBUS_PATH=backend/vbd/12/5632
> May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/xen-hotplug-cleanup:
> XENBUS_PATH=backend/vbd/12/768
> May  7 16:31:40 qclsles01 ifdown: vif12.0
> May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/xen-hotplug-cleanup:
> XENBUS_PATH=backend/vif/12/0
> May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/xen-hotplug-cleanup:
> XENBUS_PATH=backend/vbd/12/832
> May  7 16:31:40 qclsles01 ifdown: Interface not available and no
> configuration found.
> May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
> (resource_qclvmsles02:stop:stderr) Error: the domain
'resource_qclvmsles02'
> does not exist.
> May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
> (resource_qclvmsles02:stop:stdout) Domain resource_qclvmsles02
terminated
> May  7 16:31:41 qclsles01 crmd: [22028]: WARN: process_lrm_event:lrm.cLRM
> operation (35) stop_0 on resource_qclvmsles02 Error: (4) insufficient
> privileges
> May  7 16:31:41 qclsles01 cib: [22024]: info: activateCibXml:io.c CIB
size
> is 164748 bytes (was 161648)
> May  7 16:31:41 qclsles01 crmd: [22028]: info:
> do_state_transition:fsa.cqclsles01: State transition
> S_TRANSITION_ENGINE -> S_POLICY_ENGINE [
> input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ]
> May  7 16:31:41 qclsles01 tengine: [22591

Re: [Linux-HA] Best effort HA

2007-05-08 Thread Yan Fitterer
Comments in-line

>>> On Tue, May 8, 2007 at 10:13 AM, in message <[EMAIL PROTECTED]>,
Kai Bjørnstad <[EMAIL PROTECTED]> wrote: 
> Unfortunately the the environment the ha server/cluster I am trying to 
> configure does not really fit with a grouping of IP/filsystem/lsb.
> In short: All the LSB services should be available on the same IP and there 
> is 
> not necessarily a mapping between the filesystems and the LSB script (so I 
> just have to play it safe) 
> 
> 
> The ruleset is not that complicated really, it's just a lot of them :-)
> 
> - The IP group has co-location and order on
> - The Filesystem and LSB group has co-location on and order off
> 
> - Colocation between all Filesystems to the IP-group
> - Colocation between all LSB scripts to the IP-group
> - Colocation between all LSB scrpts and all Filesystems
> 
> - Startorder from all LSB scripts to all Filesystems (this to enable 
> restart)
> - Startorder between the groups: IP-group before Filesystems-group before 
> LSB-group

To me, this looks like you need _everything_ in a single group...

> 
> What I still do not understand is that a failed LSB-script does not trigger 
> a 
> failover?? Neither does a failed Filesystem (it only stops all LSB-scripts).
> Only a failed IP trigger a failover.

Don't forget: In a large dependency "group", when one resource wants to move, 
many others are trying to stay, depending
on your default-resource-stickiness and default-resource-failure-stickiness.

To understand what's going on, you need to get some insight on the scores 
allocates to each node / resource by the
Policy Engine.

On a running configuration (i.e the currently running configuration), you can 
get that (or at least some of it, not
sure...) with crm_verify -L -VV (possibly more Vs required).

To analyze retrospectively how decisions were made on a given transition, use 
ptest on the relevant file from
/var/lib/heartbeat/pengine. Again, you'll need to crank up the verbosity.

> 
> Does this have anything to do with the "stickiness stuff"?? I  have
> default-resource-stickiness = "100"
> default-resource-failure-stickiness = "-INFINITY"

Both those will play a role (in particular the -failure- one (in this case, it 
means failed services should move at
first failure.

> 
> 
> 
> On Monday 07 May 2007 18:09:48 Yan Fitterer wrote:
>> Haven't looked at too much detail (lots of resources / constraints in
>> your cib...), but I would approach the problem differently:
>>
>> Make groups out of related IP / filesystem / service stacks.
>>
>> Then use the colocation constraints between services (across groups) to
>> force things to move together (if it is indeed what you are trying to
>> achieve).
>>
>> As well, I would start with maybe less resources, to make
>> experimentation and troubleshooting easier...
>>
>> What you describe below would seem broadly possible to me.
>>
>> My 2c
>>
>> Yan
>>
>> Kai Bjørnstad wrote:
>> > Hi,
>> >
>> > I am trying to setup an Active-Passive HA cluster dong "best effort" with
>> > little success.
>> > I am using Heartbeat 2.0.8
>> >
>> > I have a set of IP resources, a set of external (iSCSI) mount resources
>> > and a set of LSB script resources.
>> >
>> > The goal of the configuration is to make Heartbeat do the following:
>> > - All resources should run on the same node at all times
>> > - If one or more of the IPs go down on, move all resources to the backup
>> > node. If no backup node is available, shut everything down.
>> > - If one or more of the mounts go down, move all resources (including
>> > IPs) to the backup node. If no backup node is available shut down all the
>> > LSB scripts and the failed mounts. Keep the mounts and IPs that did not
>> > fail up. - If one or more of the LSB scripts fail, move all resources to
>> > the backup node (including mounts and IPs). If the no backup node is
>> > available shut down the failed LSB script(s) but keep all other resoruces
>> > running (best effort) - Of course local restart should be attempted
>> > before moving to backup node. - Start IPs and Mounts before the LSB
>> > scripts
>> > - Start/restart order of IPs should not be enforced
>> > - Start/restart order of Mounts should not be enforced
>> > - Start/restart order of LSBs should not be enforced
>> >
>> > My question is basically: Is this at all possible???
>> >
> 
> 
> -- 
> Kai R. Bjrnstad
> Senior Software Engineer
> dir. +47 22 62 89 43
> mob. +47 99 57 79 11
> tel. +47 22 62 89 50
> fax. +47 22 62 89 51
> [EMAIL PROTECTED]
> 
> Olaf Helsets vei 6
> N0621 Oslo, Norway
> 
> Scali - www.scali.com
> Scaling the Linux Datacenter
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://li

Re: [Linux-HA] Cannot create group containing drbd using HB GUI

2007-05-08 Thread Doug Knight
Hi Andrew,
I opened Bugzilla 1572, and included as much as I could find that I felt
was relevant. It has the complete logs, cibadmin -Q output, diffs of two
C files that had to be changed to build, and the two OCF scripts I'm
using that have had minor modifications made to them (mostly for
additional debug output). If there is anything else that would be
useful, let me know. In the mean time I'm going to rebuild from scratch
heartbeat on node2, since that seems to be where the problem starts.

Doug

On Mon, 2007-05-07 at 10:13 +0200, Andrew Beekhof wrote:

> can you open a bug for this and include the _complete_ logs as well as
> which version you're running (as I no longer recall)
> 
> On 5/4/07, Doug Knight <[EMAIL PROTECTED]> wrote:
> > It seems the two nodes in my cluster are behaving differently from each
> > other. First, some simplification/mapping for node names to compare to
> > the attached logs:
> >
> > node1 - arc-tkincaidlx
> > node2 - arc-dknightlx
> >
> > And references to the resource group include Filesystem, pgsql, and
> > IPaddr colocated and ordered resources
> >
> > Heartbeat shutdowns and restarts on node1, regardless of whether it is
> > DC, has active resources, etc, all perform as expected. If the resources
> > are on node1, they migrate successfully to node2. If the location
> > constraint sets the resources to node1, and node1 re-enters the cluster,
> > all resources migrate back. Its when ANY heartbeat stop, start, restart,
> > occurs on node2 that things break. For instance:
> >
> > node1 is DC, master rsc_drbd_7788:1, group active
> > node2 is slave rsc_drbd_7788:0 ONLY
> > /etc/init.d/heartbeat stop is executed on node2
> > node1 tries to execute a demote on rsc_drbd_7788:1
> > demote fails because group is active on node1, Filesystem is holding the
> > drbd device open via mount point
> > heartbeat continues to loop trying to demote on node1, about 9 times a
> > second
> > heartbeat on node2, where stop was executed, loops calling
> > notify/pre/demote on rsc_drbd_7788:0, about once a second
> >
> > It takes a manual kill of heartbeat to get things back in order, and in
> > the mean time drbd goes split brain, or so it seems by what I have to do
> > to manually get drbd connected again. So, the problem is that heartbeat
> > thinks it needs to demote the master rsc_drbd_7788:1 resource, and even
> > if this was correct, it doesn't handle the group resources that are
> > dependent on it and ordered/colocated with it. The attached logs cover
> > the entire sequence of events during the shutdown of heartbeat on node2.
> > Times of significance to help in looking at the logs are:
> >
> > Node2 HB shutdown started at 14:03:31
> > Manually started killing HB on node2 at 14:05:33
> > Node2 completed HB shutdown at 14:06:03
> > Node2 Timer pop at 14:06:33
> > Node1 HB shutdown to try to alleviate looping at 14:07:51
> >
> >  The logs are kind of large due to the looping (I deleted most of the
> > looping, so if more info is needed I can provide the complete logs), and
> > I've zipped them up, so if this email exceeds the list's size limits I
> > respectfully ask the moderator to allow it to go through.
> >
> > Doug Knight
> > WSI, Inc.
> >
> >
> > > > > > digging into that now. If I shutdown the node that does not have the
> > > > > > active resources, the following happens:
> > > > > >
> > > > > > (State: DC on active node1, running drbd master and group resources)
> > > > > > shutdown node2
> > > > > > demote attempted on node1 for drbd master,
> > > > >
> > > > > Why demote? It's master running on a good node.
> > > > >
> > > >
> > > > Don't know, this is what I observed. I wondered why it would do a demote
> > > > when this node is already OK.
> > > >
> > > > > > no attempt at halting groups
> > > > > > resources that depend on drbd
> > > > >
> > > > > Why should the resources be stopped? You shutdown a node which
> > > > > doesn't have any resources.
> > > > >
> > > >
> >
> > truncated...
> >
> > ___
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> >
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Xen-HA - SLES X86_64

2007-05-08 Thread Rene Purcell

On 5/8/07, Rene Purcell <[EMAIL PROTECTED]> wrote:




On 5/8/07, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
>
> grep ERROR logfile
>
> try this for starters:
>
> May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
> (resource_qclvmsles02:stop:stderr) Error: the domain
> 'resource_qclvmsles02'
> does not exist.
> May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
> (resource_qclvmsles02:stop:stdout) Domain resource_qclvmsles02
> terminated
> May  7 16:31:41 qclsles01 crmd: [22028]: WARN: process_lrm_event:lrm.cLRM
> operation (35) stop_0 on resource_qclvmsles02 Error: (4) insufficient
> privileges


yup I saw that.. it's weird. Heartbeat shutdown the vm, then say these
errors.. and if I cleanup the ressource he restart on the correct node..
There should be something I missed lol


On 5/7/07, Rene Purcell <[EMAIL PROTECTED] > wrote:
> > I would like to know if someone had tried the Novell setup described
> in "
> > http://www.novell.com/linux/technical_library/has.pdf " with a x86_64
> arch ?
> >
> > I've tested this setup with a classic x86 arch and everything was
> ok... but
> > I doublechecked my config and everything look good but my VM never
> start on
> > his original node when it come back online... and I can't find why!
> >
> >
> > here's the log when my node1 come back.. we can see the VM shutting
> down and
> > after that nothing happend in the other node..
> >
> > May  7 16:31:25 qclsles01 cib: [22024]: info:
> > cib_diff_notify:notify.cUpdate (client: 6403, call:13):
> > 0.65.1020 -> 0.65.1021 (ok)
> > May  7 16:31:25 qclsles01 tengine: [22591]: info:
> > te_update_diff:callbacks.cProcessing diff (cib_update):
> > 0.65.1020 -> 0.65.1021
> > May  7 16:31:25 qclsles01 tengine: [22591]: info:
> > extract_event:events.cAborting on transient_attributes changes
> > May  7 16:31:25 qclsles01 tengine: [22591]: info:
> update_abort_priority:
> > utils.c Abort priority upgraded to 100
> > May  7 16:31:25 qclsles01 tengine: [22591]: info:
> update_abort_priority:
> > utils.c Abort action 0 superceeded by 2
> > May  7 16:31:26 qclsles01 cib: [22024]: info: activateCibXml: io.c CIB
> size
> > is 161648 bytes (was 158548)
> > May  7 16:31:26 qclsles01 cib: [22024]: info:
> > cib_diff_notify:notify.cUpdate (client: 6403, call:14):
> > 0.65.1021 -> 0.65.1022 (ok)
> > May  7 16:31:26 qclsles01 haclient: on_event:evt:cib_changed
> > May  7 16:31:26 qclsles01 tengine: [22591]: info:
> > te_update_diff:callbacks.cProcessing diff (cib_update):
> > 0.65.1021 -> 0.65.1022
> > May  7 16:31:26 qclsles01 tengine: [22591]: info:
> > match_graph_event: events.cAction resource_qclvmsles02_stop_0 (9)
> > confirmed
> > May  7 16:31:26 qclsles01 cib: [25889]: info: write_cib_contents:io.cWrote
> > version 0.65.1022 of the CIB to disk (digest:
> > e71c271759371d44c4bad24d50b2421d)
> > May  7 16:31:39 qclsles01 kernel: xenbr0: port 3(vif12.0) entering
> disabled
> > state
> > May  7 16:31:39 qclsles01 kernel: device vif12.0 left promiscuous mode
> > May  7 16:31:39 qclsles01 kernel: xenbr0: port 3( vif12.0) entering
> disabled
> > state
> > May  7 16:31:39 qclsles01 logger: /etc/xen/scripts/vif-bridge: offline
> > XENBUS_PATH=backend/vif/12/0
> > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
> > XENBUS_PATH=backend/vbd/12/768
> > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
> > XENBUS_PATH=backend/vbd/12/832
> > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
> > XENBUS_PATH=backend/vbd/12/5632
> > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/vif-bridge: brctl
> delif
> > xenbr0 vif12.0 failed
> > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/vif-bridge:
> ifconfig
> > vif12.0 down failed
> > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/vif-bridge:
> Successful
> > vif-bridge offline for vif12.0, bridge xenbr0.
> > May  7 16:31:40 qclsles01 logger:
> /etc/xen/scripts/xen-hotplug-cleanup:
> > XENBUS_PATH=backend/vbd/12/5632
> > May  7 16:31:40 qclsles01 logger:
> /etc/xen/scripts/xen-hotplug-cleanup:
> > XENBUS_PATH=backend/vbd/12/768
> > May  7 16:31:40 qclsles01 ifdown: vif12.0
> > May  7 16:31:40 qclsles01 logger:
> /etc/xen/scripts/xen-hotplug-cleanup:
> > XENBUS_PATH=backend/vif/12/0
> > May  7 16:31:40 qclsles01 logger:
> /etc/xen/scripts/xen-hotplug-cleanup:
> > XENBUS_PATH=backend/vbd/12/832
> > May  7 16:31:40 qclsles01 ifdown: Interface not available and no
> > configuration found.
> > May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
> > (resource_qclvmsles02:stop:stderr) Error: the domain
> 'resource_qclvmsles02'
> > does not exist.
> > May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
> > (resource_qclvmsles02:stop:stdout) Domain resource_qclvmsles02
> terminated
> > May  7 16:31:41 qclsles01 crmd: [22028]: WARN: process_lrm_event:lrm.cLRM
> > operation (35) stop_0 on resource_qclvmsles02 Error: (4) insufficient
> > privileges
> > May  7 16:31:41 qclsles01 cib: [22024]: info: activateCibXml:io.c CIB
> size
> > is

SV: [Linux-HA] Advice with heartbeat on RHAS4

2007-05-08 Thread Peter Sørensen
Hi,

I tried to download the fedora rpms but get the following:

error: Failed dependencies:
fedora-usermgmt is needed by heartbeat-2.0.9-43.1.i386
libc.so.6(GLIBC_2.4) is needed by heartbeat-2.0.9-43.1.i386
libcrypto.so.6 is needed by heartbeat-2.0.9-43.1.i386
libgnutls.so.13 is needed by heartbeat-2.0.9-43.1.i386
libgnutls.so.13(GNUTLS_1_3) is needed by heartbeat-2.0.9-43.1.i386
libpam.so.0(LIBPAM_1.0) is needed by heartbeat-2.0.9-43.1.i386
libpils.so.1 is needed by heartbeat-2.0.9-43.1.i386
libstonith.so.1 is needed by heartbeat-2.0.9-43.1.i386
rtld(GNU_HASH) is needed by heartbeat-2.0.9-43.1.i386

Most of them can be solved but I feel a little uncertain on the fedora-usermgtm
Anyone tried this on RHEL4? 

Regards

Peter Sorensen/University of Southern Denmark/email:[EMAIL PROTECTED]   


-Oprindelig meddelelse-
Fra: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] På vegne af Andrew Beekhof
Sendt: 8. maj 2007 11:44
Til: General Linux-HA mailing list
Emne: Re: [Linux-HA] Advice with heartbeat on RHAS4

you could try one of the fedora rpms at:
   http://software.opensuse.org/download/server:/ha-clustering

even just having a peak at the spec file might help.

On 5/8/07, Peter Sørensen <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have been playing around with heartbeat version 2.04-1 and 
> drbd-8.0.0 to setup a High Available MySql server. In my testsetup I 
> use 2 VmWare servers running RHAS4 (2.6.9-42).
>
> The Mysql is only started on one of the nodes and use a shared IP-address.
> Shutting down the active server or just heartbeat => the second node 
> takes over and start mysql. This works OK.
>
> Now I want to make some monitoring on the mysql application and 
> started reading about mon but at the same time it came to my attention 
> that the 2.0.8 version of heartbeat had some features to do that along with a 
> GUI interface.
>
> I started to do a compile of the 2.0.8 version and ran into a lot of 
> problems with the installed python version (2.3.4). Tried to upgrade 
> this and now everything is messed up.
>
> So now I am going back to scratch with a clean RHAS install (2.6.9-42) and 
> python 2.3.4.
>
> I would like to know if it is possible to get RPM's to install all the 
> nessacary packages and if where do I find them?
>
> If not what do I need to compile and install to make it work.
>
> The heartbeat-2.0.8 package of course but this seems to require a 
> newer version of python ( the GUI interface) than the one I have 
> installed
>
> Where can I find all the dependencies to make it work?
>
> Regards
>
>
> Peter Sorensen/University og Southern Denmark/Email: [EMAIL PROTECTED] 
> ___
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Xen-HA - SLES X86_64

2007-05-08 Thread Andrew Beekhof

On 5/8/07, Rene Purcell <[EMAIL PROTECTED]> wrote:

On 5/8/07, Rene Purcell <[EMAIL PROTECTED]> wrote:
>
>
>
> On 5/8/07, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> >
> > grep ERROR logfile
> >
> > try this for starters:
> >
> > May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
> > (resource_qclvmsles02:stop:stderr) Error: the domain
> > 'resource_qclvmsles02'
> > does not exist.
> > May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
> > (resource_qclvmsles02:stop:stdout) Domain resource_qclvmsles02
> > terminated
> > May  7 16:31:41 qclsles01 crmd: [22028]: WARN: process_lrm_event:lrm.cLRM
> > operation (35) stop_0 on resource_qclvmsles02 Error: (4) insufficient
> > privileges
>
>
> yup I saw that.. it's weird. Heartbeat shutdown the vm, then say these
> errors.. and if I cleanup the ressource he restart on the correct node..
> There should be something I missed lol
>
>
> On 5/7/07, Rene Purcell <[EMAIL PROTECTED] > wrote:
> > > I would like to know if someone had tried the Novell setup described
> > in "
> > > http://www.novell.com/linux/technical_library/has.pdf " with a x86_64
> > arch ?
> > >
> > > I've tested this setup with a classic x86 arch and everything was
> > ok... but
> > > I doublechecked my config and everything look good but my VM never
> > start on
> > > his original node when it come back online... and I can't find why!
> > >
> > >
> > > here's the log when my node1 come back.. we can see the VM shutting
> > down and
> > > after that nothing happend in the other node..
> > >
> > > May  7 16:31:25 qclsles01 cib: [22024]: info:
> > > cib_diff_notify:notify.cUpdate (client: 6403, call:13):
> > > 0.65.1020 -> 0.65.1021 (ok)
> > > May  7 16:31:25 qclsles01 tengine: [22591]: info:
> > > te_update_diff:callbacks.cProcessing diff (cib_update):
> > > 0.65.1020 -> 0.65.1021
> > > May  7 16:31:25 qclsles01 tengine: [22591]: info:
> > > extract_event:events.cAborting on transient_attributes changes
> > > May  7 16:31:25 qclsles01 tengine: [22591]: info:
> > update_abort_priority:
> > > utils.c Abort priority upgraded to 100
> > > May  7 16:31:25 qclsles01 tengine: [22591]: info:
> > update_abort_priority:
> > > utils.c Abort action 0 superceeded by 2
> > > May  7 16:31:26 qclsles01 cib: [22024]: info: activateCibXml: io.c CIB
> > size
> > > is 161648 bytes (was 158548)
> > > May  7 16:31:26 qclsles01 cib: [22024]: info:
> > > cib_diff_notify:notify.cUpdate (client: 6403, call:14):
> > > 0.65.1021 -> 0.65.1022 (ok)
> > > May  7 16:31:26 qclsles01 haclient: on_event:evt:cib_changed
> > > May  7 16:31:26 qclsles01 tengine: [22591]: info:
> > > te_update_diff:callbacks.cProcessing diff (cib_update):
> > > 0.65.1021 -> 0.65.1022
> > > May  7 16:31:26 qclsles01 tengine: [22591]: info:
> > > match_graph_event: events.cAction resource_qclvmsles02_stop_0 (9)
> > > confirmed
> > > May  7 16:31:26 qclsles01 cib: [25889]: info: write_cib_contents:io.cWrote
> > > version 0.65.1022 of the CIB to disk (digest:
> > > e71c271759371d44c4bad24d50b2421d)
> > > May  7 16:31:39 qclsles01 kernel: xenbr0: port 3(vif12.0) entering
> > disabled
> > > state
> > > May  7 16:31:39 qclsles01 kernel: device vif12.0 left promiscuous mode
> > > May  7 16:31:39 qclsles01 kernel: xenbr0: port 3( vif12.0) entering
> > disabled
> > > state
> > > May  7 16:31:39 qclsles01 logger: /etc/xen/scripts/vif-bridge: offline
> > > XENBUS_PATH=backend/vif/12/0
> > > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
> > > XENBUS_PATH=backend/vbd/12/768
> > > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
> > > XENBUS_PATH=backend/vbd/12/832
> > > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
> > > XENBUS_PATH=backend/vbd/12/5632
> > > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/vif-bridge: brctl
> > delif
> > > xenbr0 vif12.0 failed
> > > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/vif-bridge:
> > ifconfig
> > > vif12.0 down failed
> > > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/vif-bridge:
> > Successful
> > > vif-bridge offline for vif12.0, bridge xenbr0.
> > > May  7 16:31:40 qclsles01 logger:
> > /etc/xen/scripts/xen-hotplug-cleanup:
> > > XENBUS_PATH=backend/vbd/12/5632
> > > May  7 16:31:40 qclsles01 logger:
> > /etc/xen/scripts/xen-hotplug-cleanup:
> > > XENBUS_PATH=backend/vbd/12/768
> > > May  7 16:31:40 qclsles01 ifdown: vif12.0
> > > May  7 16:31:40 qclsles01 logger:
> > /etc/xen/scripts/xen-hotplug-cleanup:
> > > XENBUS_PATH=backend/vif/12/0
> > > May  7 16:31:40 qclsles01 logger:
> > /etc/xen/scripts/xen-hotplug-cleanup:
> > > XENBUS_PATH=backend/vbd/12/832
> > > May  7 16:31:40 qclsles01 ifdown: Interface not available and no
> > > configuration found.
> > > May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
> > > (resource_qclvmsles02:stop:stderr) Error: the domain
> > 'resource_qclvmsles02'
> > > does not exist.
> > > May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
> > > (resource_qclvmsles02:stop:stdout) Domain resource

Re: [Linux-HA] NFS server not started by heartbeat

2007-05-08 Thread Martijn Grendelman
Hi all,

>> No, you are missing the point.
>>
>> 'mon start' returns 1, because Mon is already running, as it should be,
>> since this is the active node. The question is: why is Heartbeat trying
>> to start Mon and all other resources, while it already runs all of them?
> 
> No. "start" must succeed if it is already running. (idempotent) Said
> script is broken.

Thanks for the pointers. I have written several 'intermediate' resource
scripts, to catch broken init-scripts, and Heartbeat's behaviour is back
to what it should be.

Best regards,

Martijn Grendelman
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Advice with heartbeat on RHAS4

2007-05-08 Thread Andrew Beekhof

On 5/8/07, Peter Sørensen <[EMAIL PROTECTED]> wrote:

Hi,

I tried to download the fedora rpms but get the following:

error: Failed dependencies:
fedora-usermgmt is needed by heartbeat-2.0.9-43.1.i386
libc.so.6(GLIBC_2.4) is needed by heartbeat-2.0.9-43.1.i386
libcrypto.so.6 is needed by heartbeat-2.0.9-43.1.i386
libgnutls.so.13 is needed by heartbeat-2.0.9-43.1.i386
libgnutls.so.13(GNUTLS_1_3) is needed by heartbeat-2.0.9-43.1.i386
libpam.so.0(LIBPAM_1.0) is needed by heartbeat-2.0.9-43.1.i386
libpils.so.1 is needed by heartbeat-2.0.9-43.1.i386
libstonith.so.1 is needed by heartbeat-2.0.9-43.1.i386
rtld(GNU_HASH) is needed by heartbeat-2.0.9-43.1.i386

Most of them can be solved but I feel a little uncertain on the fedora-usermgtm
Anyone tried this on RHEL4?


i think thats just for the pre and post (un)install scripts.  i'd
solve the rest and use --force
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] stonith setup

2007-05-08 Thread Peter Clapham

John Hearns wrote:

Jure Pečar wrote:

Hi all,

I'm trying to configure iLo stonith with heartbeat 2.0.8. Calling 
external/riloe on the command line with all the RI_* variables in the 
environment works fine.  Calling stonith -t external/riloe with RI_* 
variables and -S also reports OK. However, I'm having troubles 
figuring out how to put these variables in a file that I can call 
with -F. I haven't found any examples or documentation describing the 
syntax ...


I figured that if I put all variables in signle line like this:

RI_HOST=host RI_HOSTRI=host-ilo RI_LOGIN=Administrator 
RI_PASSWORD=password


stonith doesn't complain about syntax but says "device not 
accessible". Why?




Jure, can you access the ILO device using imitool?
Install ipmitool and try something like:

ipmitool -I lan -H host-ilo -U Administraror -P password sdr


(or 'chassis power status' instead of sdr)



John is spot on. If the ipmi output works as expected then there's a 
simple X4200 stonith script previously posted here that should help.


Pete
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] IPMI STONITH

2007-05-08 Thread Peter Clapham

Hannes Dorbath wrote:
http://lists.community.tummy.com/pipermail/linux-ha-dev/2003-August/006425.html 



Is there anything newer/better available, or this is everything that 
is out there?



Thanks.



Hello Hannnes

There have been a few posted here in the list (X4200 stonith, which I 
can mail directly if required) and I believe that an openhpi version is 
in the development process.


Pete
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Xen-HA - SLES X86_64

2007-05-08 Thread Rene Purcell

Ok.. so there something I probably don't understand.. each node should have
right privilege ( if we are talking about filesystem permission ) because
when one of my two node failed(node1), the ressource(vm1) can start on the
available node..(node2) the problem happend when the failed node comeback
online(node1).. the ressoruce(vm1) is supposed to shutdown on node2 and
restart on the node1 isn'it ?

We already try this setup with a SLES 32bits everything was working.. I just
want to know where the problem can be.. is it my configuration ? it's
supposed to be exactly the same as my old setup.. is it the 64bits version
of SLES ?


when I set in the default config:
symetric cluster = yes
default ressource stickiness = INFINITY

and I add a place constraints, score INFINITY,expression #uname eq node1

my ressource is not supposed to go back to his original node ?? like if I
set  auto_failback option in heartbeat V1 ??

I'm sorry if my previous post was not clear..


On 5/8/07, Andrew Beekhof <[EMAIL PROTECTED]> wrote:


On 5/8/07, Rene Purcell <[EMAIL PROTECTED]> wrote:
> On 5/8/07, Rene Purcell <[EMAIL PROTECTED]> wrote:
> >
> >
> >
> > On 5/8/07, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> > >
> > > grep ERROR logfile
> > >
> > > try this for starters:
> > >
> > > May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
> > > (resource_qclvmsles02:stop:stderr) Error: the domain
> > > 'resource_qclvmsles02'
> > > does not exist.
> > > May  7 16:31:41 qclsles01 lrmd: [5020]: info: RA output:
> > > (resource_qclvmsles02:stop:stdout) Domain resource_qclvmsles02
> > > terminated
> > > May  7 16:31:41 qclsles01 crmd: [22028]: WARN: process_lrm_event:
lrm.cLRM
> > > operation (35) stop_0 on resource_qclvmsles02 Error: (4)
insufficient
> > > privileges
> >
> >
> > yup I saw that.. it's weird. Heartbeat shutdown the vm, then say these
> > errors.. and if I cleanup the ressource he restart on the correct
node..
> > There should be something I missed lol
> >
> >
> > On 5/7/07, Rene Purcell <[EMAIL PROTECTED] > wrote:
> > > > I would like to know if someone had tried the Novell setup
described
> > > in "
> > > > http://www.novell.com/linux/technical_library/has.pdf " with a
x86_64
> > > arch ?
> > > >
> > > > I've tested this setup with a classic x86 arch and everything was
> > > ok... but
> > > > I doublechecked my config and everything look good but my VM never
> > > start on
> > > > his original node when it come back online... and I can't find
why!
> > > >
> > > >
> > > > here's the log when my node1 come back.. we can see the VM
shutting
> > > down and
> > > > after that nothing happend in the other node..
> > > >
> > > > May  7 16:31:25 qclsles01 cib: [22024]: info:
> > > > cib_diff_notify:notify.cUpdate (client: 6403, call:13):
> > > > 0.65.1020 -> 0.65.1021 (ok)
> > > > May  7 16:31:25 qclsles01 tengine: [22591]: info:
> > > > te_update_diff:callbacks.cProcessing diff (cib_update):
> > > > 0.65.1020 -> 0.65.1021
> > > > May  7 16:31:25 qclsles01 tengine: [22591]: info:
> > > > extract_event:events.cAborting on transient_attributes changes
> > > > May  7 16:31:25 qclsles01 tengine: [22591]: info:
> > > update_abort_priority:
> > > > utils.c Abort priority upgraded to 100
> > > > May  7 16:31:25 qclsles01 tengine: [22591]: info:
> > > update_abort_priority:
> > > > utils.c Abort action 0 superceeded by 2
> > > > May  7 16:31:26 qclsles01 cib: [22024]: info: activateCibXml: io.cCIB
> > > size
> > > > is 161648 bytes (was 158548)
> > > > May  7 16:31:26 qclsles01 cib: [22024]: info:
> > > > cib_diff_notify:notify.cUpdate (client: 6403, call:14):
> > > > 0.65.1021 -> 0.65.1022 (ok)
> > > > May  7 16:31:26 qclsles01 haclient: on_event:evt:cib_changed
> > > > May  7 16:31:26 qclsles01 tengine: [22591]: info:
> > > > te_update_diff:callbacks.cProcessing diff (cib_update):
> > > > 0.65.1021 -> 0.65.1022
> > > > May  7 16:31:26 qclsles01 tengine: [22591]: info:
> > > > match_graph_event: events.cAction resource_qclvmsles02_stop_0 (9)
> > > > confirmed
> > > > May  7 16:31:26 qclsles01 cib: [25889]: info: write_cib_contents:
io.cWrote
> > > > version 0.65.1022 of the CIB to disk (digest:
> > > > e71c271759371d44c4bad24d50b2421d)
> > > > May  7 16:31:39 qclsles01 kernel: xenbr0: port 3(vif12.0) entering
> > > disabled
> > > > state
> > > > May  7 16:31:39 qclsles01 kernel: device vif12.0 left promiscuous
mode
> > > > May  7 16:31:39 qclsles01 kernel: xenbr0: port 3( vif12.0)
entering
> > > disabled
> > > > state
> > > > May  7 16:31:39 qclsles01 logger: /etc/xen/scripts/vif-bridge:
offline
> > > > XENBUS_PATH=backend/vif/12/0
> > > > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
> > > > XENBUS_PATH=backend/vbd/12/768
> > > > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
> > > > XENBUS_PATH=backend/vbd/12/832
> > > > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/block: remove
> > > > XENBUS_PATH=backend/vbd/12/5632
> > > > May  7 16:31:40 qclsles01 logger: /etc/xen/scripts/vif-b

[Linux-HA] NewToHA2

2007-05-08 Thread Eric Marcus

Hello, I am new to HA2 and am having some configuration issues.   I installed 
HA2  (2.0.8-1) on two Suse 10 (SLES10) machines using Alan's Education Project 
Screencast (http://www.linux-ha.org/Education/Newbie/InstallHeartbeatScreencast)

I think I have a node configuration issue even though it is in ha.cf.   I am 
very familiar with Novell Cluster Services.   The problem I outline below makes 
me think that both of the nodes are trying to be the "Master" but I don't how 
to fix this.  I've spent a week on this and am feeling very stupid!   Here 
goes.

My ha.cf file for the 2 servers shows

use_logd yes
bcast eth1
node it-mgatedom it-mgatedomc
crm on


The logd.cf shows

logfacility daemon


The authkeys show

auth 1
1 sha1 cluster1


Now, when I start it up on IT-MGATEDOM,  it shows "done"

crm_mon shows only 1 node configured and after a couple minutes the "Current 
DC: NONE" becomes "Current DC: it-mgatedom" with 0 resources configured.  It 
still shows 1 node, not 2.  

Then I go to IT-MGATEDOMC to start it up..   It says "done" and when I do a 
tail /var/log/message I see this



it-mgatedomc:~ # /etc/init.d/heartbeat start
Starting High-Availability services:
 done

it-mgatedomc:~ # tail /var/log/messages
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: 
G_main_add_TriggerHandler:  Added signal manual handler
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: 
G_main_add_TriggerHandler:  Added signal manual handler
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: Removing 
/var/run/heartbea t/rsctmp failed, recreating.
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: glib: UDP Broadcast 
heartb eat started on port 694 (694) interface eth1
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: glib: UDP Broadcast 
heartb eat closed on port 694 interface eth1 - Status: 1
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: G_main_add_SignalHandler: 
Added signal handler for signal 17
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: Local status now set to: 
' up'
May  8 12:06:17 it-mgatedomc heartbeat: [4514]: info: Link it-mgatedom:eth1 up.
May  8 12:06:17 it-mgatedomc heartbeat: [4514]: info: Status update for node 
it- mgatedom: status active
May  8 12:06:17 it-mgatedomc heartbeat: [4514]: info: Link it-mgatedomc:eth1 up.
it-mgatedomc:~ # tail /var/log/messages
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: 
G_main_add_TriggerHandler:  Added signal manual handler
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: 
G_main_add_TriggerHandler:  Added signal manual handler
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: Removing 
/var/run/heartbea t/rsctmp failed, recreating.
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: glib: UDP Broadcast 
heartb eat started on port 694 (694) interface eth1
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: glib: UDP Broadcast 
heartb eat closed on port 694 interface eth1 - Status: 1
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: G_main_add_SignalHandler: 
Added signal handler for signal 17
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: Local status now set to: 
' up'
May  8 12:06:17 it-mgatedomc heartbeat: [4514]: info: Link it-mgatedom:eth1 up.
May  8 12:06:17 it-mgatedomc heartbeat: [4514]: info: Status update for node 
it- mgatedom: status active
May  8 12:06:17 it-mgatedomc heartbeat: [4514]: info: Link it-mgatedomc:eth1 up.
it-mgatedomc:~ # tail /var/log/messages
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: 
G_main_add_TriggerHandler:  Added signal manual handler
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: 
G_main_add_TriggerHandler:  Added signal manual handler
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: Removing 
/var/run/heartbea t/rsctmp failed, recreating.
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: glib: UDP Broadcast 
heartb eat started on port 694 (694) interface eth1
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: glib: UDP Broadcast 
heartb eat closed on port 694 interface eth1 - Status: 1
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: G_main_add_SignalHandler: 
Added signal handler for signal 17
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: Local status now set to: 
' up'
May  8 12:06:17 it-mgatedomc heartbeat: [4514]: info: Link it-mgatedom:eth1 up.
May  8 12:06:17 it-mgatedomc heartbeat: [4514]: info: Status update for node 
it- mgatedom: status active
May  8 12:06:17 it-mgatedomc heartbeat: [4514]: info: Link it-mgatedomc:eth1 up.
it-mgatedomc:~ # tail /var/log/messages
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: 
G_main_add_TriggerHandler:  Added signal manual handler
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: 
G_main_add_TriggerHandler:  Added signal manual handler
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: info: Removing 
/var/run/heartbea t/rsctmp failed, recreating.
May  8 12:06:16 it-mgatedomc heartbeat: [4514]: 

[Linux-HA] heartbeat and drbd problems

2007-05-08 Thread Dan Gahlinger

I've gotten DRBD working really well on it's own now, understanding it more.

Now I want to add heartbeat (v2) to the equation.

I'm running Suse Linux 10.2 if it matters.

My problem is I can't seem to get heartbeat to mount the drive.

I'm doing a single drive, single partition test for now.

I am only running one server for the heartbeat part (I just want to get it
going locally first).

I can get it to bring up the virtual IP portion, if I only put that in the
haresources, and that works ok.
But if I try DRBD by itself, or with the filesystem configured, it doesn't
work.

part of the problem
it complains about an OCF_KEY_ip being missing,
if I set this to the virtual IP, it fails,
if I set it to something else, it claims the filesystem mount succeeded, but
it hasn't actually done it.

I don't want to use the XML or OCF or clustering features for now.

my haresources looks like

amd 192.168.10.3 drbddisk::r0 Filesystem::/dev/drbd0::/mysql::ext3


manually after booting the system, for DRBD alone I would run:

drbdsetup primary --do-what-I-say
mount /dev/drbd0 /mysql

and that works perfectly.

but heartbeat with our without the drbddisk or just with the
filesystem, whatever combination doesn't work.


the one time it looked like it got it right (but never mounted it), I
checked the drbd status and it changed to
secondary/unknown

note DRBD is running across a secondary ethernet interface (cross-over
cable) using a different set of IP addresses.


The reasoning is this:

I want heartbeat to monitor the PUBLIC (real) interface to detect the
system is down, not the cross-over cable.
using heartbeat to monitor the crossover cable doesn't make sense to
me (the cable could be bad, and the real connection up)

causing all sorts of issues.

i've tried reading the heartbeat/HA documentation on DRBD website and
heartbeat/HA websites, and it just makes my head hurt.

This is as far as I can get (from the ha-log):
ResourceManager[11734]: 2007/05/08_14:07:53 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /mysql ext3 start
ResourceManager[11734]: 2007/05/08_14:07:53 ERROR: Return code 2 from
/etc/ha.d/resource.d/Filesystem
ResourceManager[11734]: 2007/05/08_14:07:53 CRIT: Giving up resources
due to failure of Filesystem::/dev/drbd0::/mysql::ext3
ResourceManager[11734]: 2007/05/08_14:07:53 info: Releasing resource
group: amd 192.168.10.3 Filesystem::/dev/drbd0::/mysql::ext3
ResourceManager[11734]: 2007/05/08_14:07:53 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /mysql ext3 stop
ResourceManager[11734]: 2007/05/08_14:07:53 ERROR: Return code 2 from
/etc/ha.d/resource.d/Filesystem

I'm not sure what I'm doing wrong.

Dan.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] IBMRSA Plugin - STONITH

2007-05-08 Thread fabiomm
Hi Everyone!

I'd like to know from you if you have already implemented a linux-ha on 
IBM xSeries Servers using the ibmrsa STONITH plugin. If so, do you have 
any documentation about it? Details on the configuration I need to 
prepare? Example of the cib.xml?

Best Regards,
Fabio Martins
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Got it up and running, now what

2007-05-08 Thread Adam Krolewski

Hi Everyone,

 I wanted to ping the group and see if anyone could recommend Linux-HA
resources.  I have the 2.0.8 rpm packages installed on my RHEL ES 4 boxes.
I have built a barebones ha.cf file and it is running correctly according to
CRM_MON.  I have hb_gui working as well.  So my next question is where do I
go from here, I am trying to setup a virtual IP that is moved from the two
boxes in case of a failure to make sure our website is up.  However I am not
really sure where to go for this.  Could anyone recommend a good doc on
this? or maybe a screencast, webpage or anything else.  I found some
resources referring on how to do this with version 1 of the app but I am
trying to follow the version 2 specification.  Any help would be
appreciated, and thanks ahead of time.

AK
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Got it up and running, now what

2007-05-08 Thread Dave Blaschke

Adam Krolewski wrote:

Hi Everyone,

 I wanted to ping the group and see if anyone could recommend Linux-HA
resources.  I have the 2.0.8 rpm packages installed on my RHEL ES 4 
boxes.
I have built a barebones ha.cf file and it is running correctly 
according to
CRM_MON.  I have hb_gui working as well.  So my next question is where 
do I
go from here, I am trying to setup a virtual IP that is moved from the 
two
boxes in case of a failure to make sure our website is up.  However I 
am not

really sure where to go for this.  Could anyone recommend a good doc on
this? or maybe a screencast, webpage or anything else.  I found some
resources referring on how to do this with version 1 of the app but I am
trying to follow the version 2 specification.  Any help would be
appreciated, and thanks ahead of time.

http://www.linux-ha.com/GettingStartedV2/OneIPAddress
http://www.linux-ha.org/Education/Newbie/IPaddrScreencast


AK
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] ERROR: Emergency Shutdown: Attempting to kill everything ourselves

2007-05-08 Thread Chan, Dan
I got the following errors.
 
ERROR: Emergency Shutdown: Attempting to kill everything ourselves
May  7 00:06:25 sss-db1b heartbeat[16020]: info: killing
/usr/lib/heartbeat/ipfail process group 16033 with signal 9
May  7 00:06:25 sss-db1b heartbeat[16020]: info: killing HBREAD process
16027 with signal 9
May  7 00:06:25 sss-db1b heartbeat[16020]: info: killing HBWRITE process
16028 with signal 9
May  7 00:06:25 sss-db1b heartbeat[16020]: info: killing HBREAD process
16029 with signal 9
May  7 00:06:25 sss-db1b heartbeat[16020]: info: killing HBWRITE process
16030 with signal 9
May  7 00:06:25 sss-db1b heartbeat[16020]: info: killing HBREAD process
16031 with signal 9
May  7 00:06:25 sss-db1b heartbeat[16020]: info: killing HBFIFO process
16023 with signal 9
May  7 00:06:25 sss-db1b heartbeat[16020]: info: killing HBWRITE process
16024 with signal 9
May  7 00:06:25 sss-db1b heartbeat[16020]: info: killing HBREAD process
16025 with signal 9
May  7 00:06:25 sss-db1b heartbeat[16020]: info: killing HBWRITE process
16026 with signal 9
May  7 00:46:06 sss-db1b rpc.statd[898]: recv_rply: can't decode RPC
message!
 
I found out that I set the deadtime too short to cause the heartbeat to
fail. 
 
Now, both of my servers is thinking they are in the transition state.
None of my server will bring up the resources. How can I clear their
state to bring up heartbeat? I am running version 1.2.3.
 
-Dan
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Re: Got it up and running, now what

2007-05-08 Thread Adam Krolewski

Thanks for the links, I followed it kind of but instead created a IPaddr2
resource that is part of a group.  The IP address seems to bounce between
both the servers now.  However I am not sure if this is the best way of
doing this or not.  Also is there a preferred method of doing heartbeat
notifications.  Right now I have it broadcasting on the same eth interface
as my IPs.

I do have extra nics in the boxes.  Does it make more sense to do it via
eth1 and on a private subnet? Real world experience would be greatly
appreciated.

 Thanks AK

Hi Everyone,


  I wanted to ping the group and see if anyone could recommend Linux-HA
resources.  I have the 2.0.8 rpm packages installed on my RHEL ES 4
boxes.  I have built a barebones ha.cf file and it is running correctly
according to CRM_MON.  I have hb_gui working as well.  So my next question
is where do I go from here, I am trying to setup a virtual IP that is moved
from the two boxes in case of a failure to make sure our website is up.
However I am not really sure where to go for this.  Could anyone recommend a
good doc on this? or maybe a screencast, webpage or anything else.  I found
some resources referring on how to do this with version 1 of the app but I
am trying to follow the version 2 specification.  Any help would be
appreciated, and thanks ahead of time.

 AK


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] heartbeat runs but doesn't

2007-05-08 Thread Dan Gahlinger

I got heartbeat to run, it says it mounts the filesystem,
everything says it works ok,
except it doesn't actually do anything for the filesystem.

it sets up the virtual IP though.

even running the "filesystem" command manually produces the same result.
nothing else in the log or debug.

not sure what's going on.

Dan.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems