[Linux-HA] which ipmi stonith plugin to use?

2009-01-14 Thread Sebastian Reitenbach
Hi,

I want to setup a cluster running on IBM servers. I've seen there is an 
internal ipmilan, and an external/ipmi stonith plugin available.
Any recommendation which one I should use?

kind regards
Sebastian
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] ldirectord load balancing decision on rtt time?

2008-09-03 Thread Sebastian Reitenbach
Hello everybody,

I have an FTP server, connected to the internet with two 16MBit DSL lines. 
I have another root server in a remote location, where I setup two openvpn 
tunnels to the ftp server, each tunnel over a separate DSL line.
I want to upload files, with heavily varying sizes, and distribute the 
traffic as best on both DSL lines, so that when there are multiple uploads, 
both lines are saturated.
Due to the nature of ftp, I'd only load balance the control connection, and 
go directly with the data connection from the client, to the server.

Therefore round robin, or similar connection count based scheduling 
algorithms don't scale that well.

Does ldirectord can make the decision which server to use as the next 
target, based on e.g. round trip times of icmp packets?
E.g. one upload is running, assuming one DSL line is more or less fully 
saturated, the next connection to ldirectord comes in, ldirectord will check 
the RTT times of icmp packets to both FTP servers via the tunnel, and then 
make a decision which one is the next, on the RTT times it measures?
Or can ldirectord make a decision on the timings of the health checks?
E.g. whenever a new connection comes in, a health check to the FTP servers 
are initiated, and the one with the fastest answer will get the next 
connection?
as far as I read the ldirectord manual page, the scheduling algorithms to 
choose do not seem to provide such functionality. I've seen, I can use 
external scripts to do health checking, but these seem to only return alive 
or dead, but not a qualitative statement, how fast reachable the servers 
are.

In case, ldirectord cannot help me right now, where should I look into the 
code, when I want to get an idea, on how to implement sth. like above as 
scheduling algorithm?

here I read, that there is also a kernel module managing ftp connections 
when using ldirectord for LVS:
http://www.ultramonkey.org/3/topologies/lb-eg.html
But they only mention old 2.4 kernel version of Linux, so I wonder if that 
would work with a modern 2.6 kernel?


kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] mgmtd not starting on opensuse 11i386(unresolvedsymbol)

2008-08-01 Thread Sebastian Reitenbach
Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> Hi,
> 
> On Wed, Jul 30, 2008 at 08:55:53AM +0200, Sebastian Reitenbach wrote:
> > General Linux-HA mailing list  wrote: 
> > > On Mon, Jul 28, 2008 at 05:52:10PM -, root wrote:
> > > > Hi,
> > > > Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> > > > > Hi,
> > > > > 
> > > > > On Mon, Jul 28, 2008 at 04:41:27PM +0200, Sebastian Reitenbach 
wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > I just upgraded my desktop to opensuse 11.0 i586, and updated 
the 
> > box, 
> > > > then 
> > > > > > installed the heartbeat rpm's 2.1.3 from download.opensuse.org.
> > > > > > 
> > > > > > I've these rpm's installed right now:
> > > > > > pacemaker-heartbeat-0.6.5-8.2
> > > > > > heartbeat-common-2.1.3-23.1
> > > > > > heartbeat-resources-2.1.3-23.1
> > > > > > heartbeat-2.1.3-23.1
> > > > > > pacemaker-pygui-1.4-1.3
> > > > > > 
> > > > > > I've added these lines to /etc/ha.d/ha.cf to start mgmtd 
> > automatically:
> > > > > > apiauth mgmtd   uid=root
> > > > > > respawn root/usr/lib/heartbeat/mgmtd -v
> > > > > > 
> > > > > > but mgmtd fails to start, when I try to start it on the 
commandline, 
> > then 
> > > > I 
> > > > > > see the following output:
> > > > > > 
> > > > > > /usr/lib/heartbeat/mgmtd: symbol lookup 
> > error: /usr/lib/libpe_status.so.2: 
> > > > > > undefined symbol: stdscr
> > > > > > 
> > > > > > As far as I researched now, the stdscr symbol is expected to 
come 
> > from 
> > > > > > ncurses?
> > > > > 
> > > > > Looks like a dependency problem. Does the package containing
> > > > > mgmtd depend on the ncurses library? Though I don't understand
> > > > > why mgmtd needs ncurses.
> > > > I found this out, in a thread in some m/l, regarding the error 
message 
> > about 
> > > > the undefined symbol, but maybe this is just wrong.
> > > 
> > > stdscr is an external variable defined in ncurses.h which is
> > > included from ./lib/crm/pengine/unpack.h which is part of the
> > > code that gets built in libpe_status. The pacemaker rpm, which
> > > includes that library, does depend on libncurses. Is that the
> > > case with the pacemaker you downloaded?
> > I've these installed:
> > rpm -qa | grep -i ncurs
> > ncurses-utils-5.6-83.1
> > libncurses5-5.6-83.1
> > yast2-ncurses-pkg-2.16.14-0.1
> > yast2-ncurses-2.16.27-8.1
> > 
> > rpm -q --requires pacemaker-heartbeat
> > /bin/sh
> > /bin/sh
> > /sbin/ldconfig
> > /sbin/ldconfig
> > rpmlib(PayloadFilesHavePrefix) <= 4.0-1
> > rpmlib(CompressedFileNames) <= 3.0.4-1
> > /bin/sh
> > /usr/bin/python
> > libbz2.so.1
> > libc.so.6
> > libc.so.6(GLIBC_2.0)
> > libc.so.6(GLIBC_2.1)
> > libc.so.6(GLIBC_2.1.3)
> > libc.so.6(GLIBC_2.2)
> > libc.so.6(GLIBC_2.3)
> > libc.so.6(GLIBC_2.3.4)
> > libc.so.6(GLIBC_2.4)
> > libccmclient.so.1
> > libcib.so.1
> > libcrmcluster.so.1
> > libcrmcommon.so.2
> > libdl.so.2
> > libgcrypt.so.11
> > libglib-2.0.so.0
> > libgnutls.so.26
> > libgnutls.so.26(GNUTLS_1_4)
> > libgpg-error.so.0
> > libhbclient.so.1
> > liblrm.so.0
> > libltdl.so.3
> > libm.so.6
> > libncurses.so.5
> > libpam.so.0
> > libpam.so.0(LIBPAM_1.0)
> > libpcre.so.0
> > libpe_rules.so.2
> > libpe_status.so.2
> > libpengine.so.3
> > libplumb.so.1
> > librt.so.1
> > libstonithd.so.0
> > libtransitioner.so.1
> > libxml2.so.2
> > libz.so.1
> > rpmlib(PayloadIsLzma) <= 4.4.2-1
> > 
> > 
> > rpm -ql libncurses5-5.6-83.1
> > /lib/libncurses.so.5
> > /lib/libncurses.so.5.6
> > ...
> > 
> > so it does require ncurses, but it is installed.
> > 
> > 
> > but 
> > nm /lib/libncurses.so.5.6
> > nm: /lib/libncurses.so.5.6: no symbols
> 
> That's fine, it means that the binary is stripped. If you take a
> look at libncurses.a (which is probably only in the development
> package), you should see some symbols. BTW, you can also try
> objdump with -T:
> 
> $ objdump -T libncurses.so.5 | grep stdscr
> 0015a630 gDO .bss   0008  Base stdscr


here I have:
objdump -T  /lib64/libncurses.so.5 | grep stdscr
002465e8 gDO .bss   0008  Basestdscr

Meanwhile I observed the problem on a opensuse 10.3 i386 and on opensue 11 
x86_64 too.

Seems like there is a general problem with this version.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] problem with pingd updating the state to attrd

2008-07-30 Thread Sebastian Reitenbach
Hi,

I'm on SLES10 and use heartbeat-2.1.3.

it worked since weeks, but for some reason, after a reboot of the two node 
cluster, pingd has a problem on one of the hosts to tell the attrd, that it 
can ping the ping node:

heartbeat[3890]: 2008/07/29_07:44:51 info: glib: ping heartbeat started.
heartbeat[3890]: 2008/07/29_07:44:53 info: Status update for node 
192.168.0.1: status ping
cib[3932]: 2008/07/29_07:45:06 info: log_data_element: readCibXmlFile: 
[on-disk] 
cib[3932]: 2008/07/29_07:45:06 info: log_data_element: readCibXmlFile: 
[on-disk]   
cib[3932]: 2008/07/29_07:45:06 info: log_data_element: readCibXmlFile: 
[on-disk]   
cib[3932]: 2008/07/29_07:45:06 info: log_data_element: readCibXmlFile: 
[on-disk]   
cib[3932]: 2008/07/29_07:45:06 info: log_data_element: readCibXmlFile: 
[on-disk]   
cib[3932]: 2008/07/29_07:45:06 info: log_data_element: readCibXmlFile: 
[on-disk]   
attrd[3935]: 2008/07/29_07:46:35 info: find_hash_entry: Creating hash entry 
for pingd
attrd[3935]: 2008/07/29_07:46:35 info: attrd_perform_update: Sent 
delete -22: pingd (null) status
heartbeat[3890]: 2008/07/29_07:49:12 ERROR: MSG: Dumping message with 23 
fields


on the boxmaster102, everything is fine, but on the boxmaster101, the pingd 
has the above shown problem.

any idea, how to get the pingd update attrd again?

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] mgmtd not starting on opensuse 11 i386(unresolvedsymbol)

2008-07-30 Thread Sebastian Reitenbach
General Linux-HA mailing list  wrote: 
> On Mon, Jul 28, 2008 at 05:52:10PM -, root wrote:
> > Hi,
> > Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> > > Hi,
> > > 
> > > On Mon, Jul 28, 2008 at 04:41:27PM +0200, Sebastian Reitenbach wrote:
> > > > Hi,
> > > > 
> > > > I just upgraded my desktop to opensuse 11.0 i586, and updated the 
box, 
> > then 
> > > > installed the heartbeat rpm's 2.1.3 from download.opensuse.org.
> > > > 
> > > > I've these rpm's installed right now:
> > > > pacemaker-heartbeat-0.6.5-8.2
> > > > heartbeat-common-2.1.3-23.1
> > > > heartbeat-resources-2.1.3-23.1
> > > > heartbeat-2.1.3-23.1
> > > > pacemaker-pygui-1.4-1.3
> > > > 
> > > > I've added these lines to /etc/ha.d/ha.cf to start mgmtd 
automatically:
> > > > apiauth mgmtd   uid=root
> > > > respawn root/usr/lib/heartbeat/mgmtd -v
> > > > 
> > > > but mgmtd fails to start, when I try to start it on the commandline, 
then 
> > I 
> > > > see the following output:
> > > > 
> > > > /usr/lib/heartbeat/mgmtd: symbol lookup 
error: /usr/lib/libpe_status.so.2: 
> > > > undefined symbol: stdscr
> > > > 
> > > > As far as I researched now, the stdscr symbol is expected to come 
from 
> > > > ncurses?
> > > 
> > > Looks like a dependency problem. Does the package containing
> > > mgmtd depend on the ncurses library? Though I don't understand
> > > why mgmtd needs ncurses.
> > I found this out, in a thread in some m/l, regarding the error message 
about 
> > the undefined symbol, but maybe this is just wrong.
> 
> stdscr is an external variable defined in ncurses.h which is
> included from ./lib/crm/pengine/unpack.h which is part of the
> code that gets built in libpe_status. The pacemaker rpm, which
> includes that library, does depend on libncurses. Is that the
> case with the pacemaker you downloaded?
I've these installed:
rpm -qa | grep -i ncurs
ncurses-utils-5.6-83.1
libncurses5-5.6-83.1
yast2-ncurses-pkg-2.16.14-0.1
yast2-ncurses-2.16.27-8.1

rpm -q --requires pacemaker-heartbeat
/bin/sh
/bin/sh
/sbin/ldconfig
/sbin/ldconfig
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rpmlib(CompressedFileNames) <= 3.0.4-1
/bin/sh
/usr/bin/python
libbz2.so.1
libc.so.6
libc.so.6(GLIBC_2.0)
libc.so.6(GLIBC_2.1)
libc.so.6(GLIBC_2.1.3)
libc.so.6(GLIBC_2.2)
libc.so.6(GLIBC_2.3)
libc.so.6(GLIBC_2.3.4)
libc.so.6(GLIBC_2.4)
libccmclient.so.1
libcib.so.1
libcrmcluster.so.1
libcrmcommon.so.2
libdl.so.2
libgcrypt.so.11
libglib-2.0.so.0
libgnutls.so.26
libgnutls.so.26(GNUTLS_1_4)
libgpg-error.so.0
libhbclient.so.1
liblrm.so.0
libltdl.so.3
libm.so.6
libncurses.so.5
libpam.so.0
libpam.so.0(LIBPAM_1.0)
libpcre.so.0
libpe_rules.so.2
libpe_status.so.2
libpengine.so.3
libplumb.so.1
librt.so.1
libstonithd.so.0
libtransitioner.so.1
libxml2.so.2
libz.so.1
rpmlib(PayloadIsLzma) <= 4.4.2-1


rpm -ql libncurses5-5.6-83.1
/lib/libncurses.so.5
/lib/libncurses.so.5.6
...

so it does require ncurses, but it is installed.


but 
nm /lib/libncurses.so.5.6
nm: /lib/libncurses.so.5.6: no symbols

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] mgmtd not starting on opensuse 11 i386 (unresolved symbol)

2008-07-28 Thread Sebastian Reitenbach
Hi,

I just upgraded my desktop to opensuse 11.0 i586, and updated the box, then 
installed the heartbeat rpm's 2.1.3 from download.opensuse.org.

I've these rpm's installed right now:
pacemaker-heartbeat-0.6.5-8.2
heartbeat-common-2.1.3-23.1
heartbeat-resources-2.1.3-23.1
heartbeat-2.1.3-23.1
pacemaker-pygui-1.4-1.3

I've added these lines to /etc/ha.d/ha.cf to start mgmtd automatically:
apiauth mgmtd   uid=root
respawn root/usr/lib/heartbeat/mgmtd -v

but mgmtd fails to start, when I try to start it on the commandline, then I 
see the following output:

/usr/lib/heartbeat/mgmtd: symbol lookup error: /usr/lib/libpe_status.so.2: 
undefined symbol: stdscr

As far as I researched now, the stdscr symbol is expected to come from 
ncurses? Do I can link the existing library/binary against ncurses, without 
the need to recompile?

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] strange behaviour of resources in a group when one resource is stopped

2008-03-18 Thread Sebastian Reitenbach
"Andrew Beekhof" <[EMAIL PROTECTED]> wrote: 
> On Wed, Feb 27, 2008 at 6:05 PM, Sebastian Reitenbach
> <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> >  I have a two node cluster, the resources are divided into two groups. 
Both
> >  groups are collocated=false and ordered=false. There are some 
constraints to
> >  keep the group on a given host, and some orders, but nothing what could
> >  explain me why happened what I have seen.
> >
> >  I stopped one resource in group_Master102, AFH_5, so this one stopped,
> 
> with target_role?
I clicked in the GUI, and said stop, so whatever the GUI is doing.

> 
> > but
> >  all other resources in that group were restarted too. I cannot see the
> >  reason why the other resources in that group were restarted too,
> 
> stopped and restarted or just sent an additional start action?
I've seen it stopping and restarting in the GUI, and with crm_mon. For a 
short time, the resource was in state stopped, and then started again.

> 
> > when I just
> >  wanted to shutdown only one of them.
> >  I expected only the AFH_5 resource to be stopped, and the others just 
stay
> >  running. When starting a stopped resource in a group, this phenomenon 
does
> >  not happen. I've appended the output of hb_report tool.
> >
> >  I'm on sles10sp1 x86_64, running heartbeat 2.1.2.
> >
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] question regarding orderings in resource groups

2008-03-18 Thread Sebastian Reitenbach
Hi,

Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> >
> > but what about the other thing I mentioned, is this then a bug?
> > with the three resources in the collocated, unordered group. I've  
> > seen the
> > seond and third resource stopping, when I shutdown the second, but  
> > the first
> > still left running.
> 
> Thats the correct behavior (and by design actually).
thanks for pointing out.

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] problem with resource groups and colocations

2008-03-13 Thread Sebastian Reitenbach
Hi,
General Linux-HA mailing list  wrote: 
> On Tue, Mar 11, 2008 at 2:32 PM, Dejan Muhamedagic <[EMAIL PROTECTED]> 
wrote:
> > Hi,
> >
> >
> >  On Tue, Mar 11, 2008 at 11:56:07AM +0100, Andreas Kurz wrote:
> >  > On Tue, Mar 11, 2008 at 11:02 AM, Sebastian Reitenbach
> >  > <[EMAIL PROTECTED]> wrote:
> >  > > Hi,
> >  > >
> >  > >  I want to achieve the following:
> >  > >  I have two groups of resources, these shall run on the same host, 
and
> >  > >  startup in a given order.
> >  > >
> >  > >  Therefore I created an order and an collocation constraint.
> >  > >  So group1 starts before group2, and the collocation says, if 
group1 is not
> >  > >  able to run on a node, group2 will not start.
> >  > >
> >  > >  However, if all resources in group1 are started, then the 
resources in
> >  > >  group2 are started too. But when I then shutdown any single 
resource in
> >  > >  group1, then group2 stops working too.
> >  > >  I am not sure, whether my collocation or order is the reason for 
the
> >  > >  observed behavior.
> >  >
> >  > I have not tried it by myself but there are these "Advisory-Only
> >  > Ordering" constraints:
> >  >
> >  > 
> >
> >  Good advice. Though even when all of the group1 is stopped,
> >  group2 won't stop either.
> 
> According to the documentation it should ... and after doing some
> ptests, I see it does ;-). If the complete group1 is stopped, group2
> is stopped to. If only one resource in group1 is stopped, group2 does
> nothing.
sorry for my late reply, but I had not time in between to test, and I have 
to say, thanks a lot, this works very well, exactly what I wanted.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] problem with resource groups and colocations

2008-03-11 Thread Sebastian Reitenbach
Hi,

I want to achieve the following:
I have two groups of resources, these shall run on the same host, and 
startup in a given order.

Therefore I created an order and an collocation constraint.
So group1 starts before group2, and the collocation says, if group1 is not 
able to run on a node, group2 will not start.

However, if all resources in group1 are started, then the resources in 
group2 are started too. But when I then shutdown any single resource in 
group1, then group2 stops working too. 
I am not sure, whether my collocation or order is the reason for the 
observed behavior. 

Right now, from what I observed, it seems, that when I stop just one 
resource in the group, then the group itself is seen as stopped, and the 
group2 stops then too.
My question is, is there a parameter or something to keep the group 
as "started", until all resources in the group, or the group completely is 
set to be stopped? is that possible?

Otherwise I'd have use very many single resources, and very many order and 
collocation constraints.

cheers
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ERROR: crm_abort: ha_set_tm_time: Triggered assertat iso8601.c:887

2008-03-03 Thread Sebastian Reitenbach
Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote: 
> On 2008-02-29T08:30:37, Sebastian Reitenbach 
<[EMAIL PROTECTED]> wrote:
> 
> > Hi,
> > 
> > I've seen these messages appearing when I connect the hb_gui to the 
mgmtd:
> > 
> > mgmtd[6819]: 2008/02/29_08:03:51 ERROR: crm_abort: ha_set_tm_time: 
Triggered 
> > assert at iso8601.c:887 : rhs->tm_mday < 0 || lhs->days == rhs->tm_mday
> 
> Hey, besides this being an obviously fairly embarrassing bug ;-), did
> anyone actually observe any misbehaviour except the dangerously looking
> error message?
No, the messages was just only disturbing, did not recognized any other 
problems related to that message. Also the error disappeared from 
crm_veryfy -LV output on the 1st of March.

> 
> Feedback on this question would greatly help in assessing the urgency
> with which we need to push the update. My guess is that it doesn't
> affect the actual operation of the cluster, except scaring the hell out
> of the admin ...
> 
cheers
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ERROR: crm_abort: ha_set_tm_time: Triggered assertatiso8601.c:887

2008-02-29 Thread Sebastian Reitenbach
Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> Hi,
> 
> On Fri, Feb 29, 2008 at 01:34:38PM +0100, Sebastian Reitenbach wrote:
> > "Damon Estep" <[EMAIL PROTECTED]> wrote: 
> > > I am getting it too, after midnight on 2/29 in leap year - looks like 
a
> > > date bug to me :)
> > that would explain that I haven't seen it before, and hopefully it will 
be 
> > fixed automagically by tomorrow ;)
> > 
> > maybe I should create a bug report so that it gets fixed.
> 
> Yes, please.

there it is:
http://developerbugs.linux-foundation.org/show_bug.cgi?id=1850

cheers
Sebastian

> 
> Thanks,
> 
> Dejan

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


RE: [Linux-HA] ERROR: crm_abort: ha_set_tm_time: Triggered assert atiso8601.c:887

2008-02-29 Thread Sebastian Reitenbach
"Damon Estep" <[EMAIL PROTECTED]> wrote: 
> I am getting it too, after midnight on 2/29 in leap year - looks like a
> date bug to me :)
that would explain that I haven't seen it before, and hopefully it will be 
fixed automagically by tomorrow ;)

maybe I should create a bug report so that it gets fixed.
thanks 
Sebastian

> 
> > -Original Message-
> > From: [EMAIL PROTECTED] [mailto:linux-ha-
> > [EMAIL PROTECTED] On Behalf Of Sebastian Reitenbach
> > Sent: Friday, February 29, 2008 12:31 AM
> > To: linux-ha@lists.linux-ha.org
> > Subject: [Linux-HA] ERROR: crm_abort: ha_set_tm_time: Triggered assert
> > atiso8601.c:887
> > 
> > Hi,
> > 
> > I've seen these messages appearing when I connect the hb_gui to the
> > mgmtd:
> > 
> > mgmtd[6819]: 2008/02/29_08:03:51 ERROR: crm_abort: ha_set_tm_time:
> > Triggered
> > assert at iso8601.c:887 : rhs->tm_mday < 0 || lhs->days ==
> rhs->tm_mday
> > 
> > and they are also shown in the output of crm_verify:
> > 
> > crm_verify -LV
> > crm_verify[17297]: 2008/02/29_08:24:25 ERROR: crm_abort:
> > ha_set_tm_time:
> > Triggered assert at iso8601.c:887 : rhs->tm_mday < 0 || lhs->days ==
> > rhs->tm_mday
> > 
> > I've made sure using ntp, that both cluster nodes have the same time.
> > I'm wondering what this message is all about, and how I could clean
> > that up?
> > To clean up failed resources, I'd take crm_resource, but how do I
> clean
> > this? I already tried to shutdown both cluster nodes, and then
> removing
> > all
> > vital stuff from /var/lib/heartbeat, e.g. rm crm/* h* delhostcache
> > pengine/*, and then restarted heartbeat, and loaded the cluster
> > configuration again. Then when connecting the hb_gui again, the error
> > message showed up again.
> > 
> > I'm on sles10sp1, running heartbeat 2.1.3
> > 
> > kind regards
> > Sebastian
> > 
> > ___
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] ERROR: crm_abort: ha_set_tm_time: Triggered assert at iso8601.c:887

2008-02-28 Thread Sebastian Reitenbach
Hi,

I've seen these messages appearing when I connect the hb_gui to the mgmtd:

mgmtd[6819]: 2008/02/29_08:03:51 ERROR: crm_abort: ha_set_tm_time: Triggered 
assert at iso8601.c:887 : rhs->tm_mday < 0 || lhs->days == rhs->tm_mday

and they are also shown in the output of crm_verify:

crm_verify -LV
crm_verify[17297]: 2008/02/29_08:24:25 ERROR: crm_abort: ha_set_tm_time: 
Triggered assert at iso8601.c:887 : rhs->tm_mday < 0 || lhs->days == 
rhs->tm_mday

I've made sure using ntp, that both cluster nodes have the same time. 
I'm wondering what this message is all about, and how I could clean that up?
To clean up failed resources, I'd take crm_resource, but how do I clean 
this? I already tried to shutdown both cluster nodes, and then removing all 
vital stuff from /var/lib/heartbeat, e.g. rm crm/* h* delhostcache 
pengine/*, and then restarted heartbeat, and loaded the cluster 
configuration again. Then when connecting the hb_gui again, the error 
message showed up again.

I'm on sles10sp1, running heartbeat 2.1.3

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] meta_attributes twice for some resources

2008-02-27 Thread Sebastian Reitenbach
Hi,
Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> Hi,
> 
> On Wed, Feb 27, 2008 at 05:30:43PM +0100, Lars Marowsky-Bree wrote:
> > On 2008-02-27T17:18:32, Sebastian Reitenbach 
<[EMAIL PROTECTED]> wrote:
> > 
> > > Your conclusions are more or less the same, I had.
> > > However, I'll create a bug report later. unfortunately, still no idea 
how it 
> > > happened. We removed the duplicate entries, replacing the CIB 
> > > (cibadmin -R -o resources), then everything was working normal again. 
We try 
> > > to fiddle around a bit with the cluster, to try to reproduce the 
problem. 
> > > When we figure out, what caused it, then I'll add this to the bug 
report 
> > > too, but I'm not very optimistic about that yet ;)
> > 
> > All commandline tools log how they were invoked. All CIB states are
> > archived in /var/lib/heartbeat/pengine; but yes, this looks as if it was
> > caused by the GUI somehow.
> 
> Could be. If that's the case, then it's really a nuisance. Or it
> could have been the cibadmin. I guess that crm_resource wouldn't
> create another meta_attributes section if there's already one
> present.
> 
haven't been able to reproduce the problem again yet, but I created a bug 
report http://developerbugs.linux-foundation.org/show_bug.cgi?id=1848


sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] strange behaviour of resources in a group when one resource is stopped

2008-02-27 Thread Sebastian Reitenbach
Hi,

I have a two node cluster, the resources are divided into two groups. Both 
groups are collocated=false and ordered=false. There are some constraints to 
keep the group on a given host, and some orders, but nothing what could 
explain me why happened what I have seen.

I stopped one resource in group_Master102, AFH_5, so this one stopped, but 
all other resources in that group were restarted too. I cannot see the 
reason why the other resources in that group were restarted too, when I just 
wanted to shutdown only one of them. 
I expected only the AFH_5 resource to be stopped, and the others just stay 
running. When starting a stopped resource in a group, this phenomenon does 
not happen. I've appended the output of hb_report tool.

I'm on sles10sp1 x86_64, running heartbeat 2.1.2.

kind regards
Sebastian



report.out2.tar.gz
Description: application/compressed
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] meta_attributes twice for some resources

2008-02-27 Thread Sebastian Reitenbach
Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> Hi,
> 
> On Tue, Feb 26, 2008 at 06:21:58PM +0100, Sebastian Reitenbach wrote:
> > Hi,
> > 
> > I was wondering today why some of the resources in the cluster behaved 
> > strangely, e.g. did not reacted on "start/stop/clean up" when clicking 
in 
> > the GUI. Then I tried this with crm_resource, and it was whining because 
the 
> > target_role matched twice, and it did not know, which one to use. So I 
took 
> > a look at the CIB, and found sth. like below, for a bunch of resources:
> > 
> >  
> >
> >  
> > > value="/pps/sw/bin/PPS/Control-BoxThread"/>
> > > name="dropbox_pid_file" 
value="/var/run/Dropbox/ArchiveFileHandler.pid"/>
> > > value="4"/>
> >  
> >
> >
> >  
> >
> >
> >  
> > > value="stopped"/>
> >  
> >
> >  
> > 
> > 
> > note the double meta_attributes. I know, I've no configuration files 
here, 
> > as I have no idea, when this happened in the last one or two days. I'm 
just 
> > asking, maybe someone has seen sth. like this before, and maybe could 
share 
> > the info what might have caused it?
> 
> Really can't say. The one with the longish id was most probably
> created by the GUI. The CRM won't touch an attribute if it finds
> more than one. Then the id must be specified as well. The only
> solution is to drop one of the meta_attributes or
> instance_attribute sets. The GUI however can't do that
> automatically (though it probably wouldn't even try at this
> stage) as only a human can figure out which one should be gone.
> I'm not sure what is the benefit of having more than one set of
> meta_attributes in a resource.  This is not exactly a bug, but I
> think that it deserves a bugzilla entry since it leads to a very
> confusing and unexpected behaviour. Can you please file one?

Your conclusions are more or less the same, I had.
However, I'll create a bug report later. unfortunately, still no idea how it 
happened. We removed the duplicate entries, replacing the CIB 
(cibadmin -R -o resources), then everything was working normal again. We try 
to fiddle around a bit with the cluster, to try to reproduce the problem. 
When we figure out, what caused it, then I'll add this to the bug report 
too, but I'm not very optimistic about that yet ;)

thanks
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] meta_attributes twice for some resources

2008-02-27 Thread Sebastian Reitenbach
Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA 
mailing list  wrote: 
> Hi,
> 
> I was wondering today why some of the resources in the cluster behaved 
> strangely, e.g. did not reacted on "start/stop/clean up" when clicking in 
> the GUI. Then I tried this with crm_resource, and it was whining because 
the 
> target_role matched twice, and it did not know, which one to use. So I 
took 
> a look at the CIB, and found sth. like below, for a bunch of resources:
> 
>  
>
>  
> value="/pps/sw/bin/PPS/Control-BoxThread"/>
> name="dropbox_pid_file" value="/var/run/Dropbox/ArchiveFileHandler.pid"/>
> value="4"/>
>  
>
>
>  
>
>
>  
> value="stopped"/>
>  
>
>  
> 
> 
> note the double meta_attributes. I know, I've no configuration files here, 
> as I have no idea, when this happened in the last one or two days. I'm 
just 
> asking, maybe someone has seen sth. like this before, and maybe could 
share 
> the info what might have caused it?

forgot to mention, I'm running heartbeat-2.1.2-28.1, on openSUSE 10.2 i586

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] meta_attributes twice for some resources

2008-02-26 Thread Sebastian Reitenbach
Hi,

I was wondering today why some of the resources in the cluster behaved 
strangely, e.g. did not reacted on "start/stop/clean up" when clicking in 
the GUI. Then I tried this with crm_resource, and it was whining because the 
target_role matched twice, and it did not know, which one to use. So I took 
a look at the CIB, and found sth. like below, for a bunch of resources:

 
   
 
   
   
   
 
   
   
 
   
   
 
   
 
   
 


note the double meta_attributes. I know, I've no configuration files here, 
as I have no idea, when this happened in the last one or two days. I'm just 
asking, maybe someone has seen sth. like this before, and maybe could share 
the info what might have caused it?

thanks 
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] question regarding orderings in resource groups

2008-02-19 Thread Sebastian Reitenbach
Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote:
> On 2008-02-19T15:49:28, Sebastian Reitenbach
<[EMAIL PROTECTED]> wrote:
>
> > > Make rsc 'from' run on the same machine as rsc 'to'
> > >
> > > If rsc 'to' cannot run anywhere and 'score' is INFINITY,
> > >   then rsc 'from' wont be allowed to run anywhere either
> > > If rsc 'from' cannot run anywhere, then 'to' wont be affected
> > >
> > > -->
> > >
> > > (You can force this to be bidirectional if you set symmetrical to true
for
> > > the
> > > colocation constraint; I don't think you can set that for groups.)
> >
> > I am aware of that, thanks. But I wanted to use groups, to not need such
a
> > lot of constraints.
>
> Yeah, I agree. You'd need N:N-1 constraints to get what you want, which
> probably wouldn't make you happy ;-)
>
> You could all colocate them with another resource (if there is one they
> need to share; perhaps the fs?) This would reduce the number to N
> constraints.
>
> Or, you could use a non-colocated, non-ordered group, and then define a
> rsc_location rule to make them all run on the same node if available.
I haven't tested this yet, because I only have a one node cluster here right
now ;), However, when I try to create a location constraint via the GUI I
can only select the group as a whole, but not the group members. When I
select the group, will then the group members automatically kept on the same
node, whatever happens? This would be just only one constraint.
If so, then I don't really understand what the colocated parameter is good
for, when I set it to false in that case, it would not make sense, and
setting it to "yes", would be redundant.
Then the collocated parameter to a group only makes sense when set to yes,
but I have no preferences, where the group should run.

>
> Or, a colocation constraint from that group to the resource you want to
> collocate with. I'm not sure this works. Would reduce the number to 1
> constraint.
yeah, would be more or less the same as a location for the whole group, as
above.

>
> Groups were meant as a short-hand for the most common case, and now
> people find out other uses for them; we need to find ways how to make
> the groups more powerful, or the constraints (to reduce the need for
> more powerful groups).

but what about the other thing I mentioned, is this then a bug?
with the three resources in the collocated, unordered group. I've seen the
seond and third resource stopping, when I shutdown the second, but the first
still left running. On your explanation in the other mail, I'd expect the
first being shutdown too, which just not happens.

kind regards
Sebastian
>
>
> Regards,
> Lars
>
> --
> Teamlead Kernel, SuSE Labs, Research and Development
> SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
>

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] question regarding orderings in resource groups

2008-02-19 Thread Sebastian Reitenbach
Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote: 
> On 2008-02-19T12:11:26, Sebastian Reitenbach 
<[EMAIL PROTECTED]> wrote:
> 
> > there ordered is set to false. I have the group running, and when I then 
> > e.g. want to stop the resource D2, then D3 stops too. Only when I change 
> > collocated to false, then D3 keeps running when I stop D2.
> > 
> > Seems to be not working as I understood it. Am I missing anything 
important 
> > here, or maybe just a bug? 
> 
> This is working as expected, I think. Because the resources are required
> to be collocated, but you stopped one, the others also have to stop.
> 

I understood the colocated parameter of a group as, when the resources run, 
then they have to run on the same host, when they not run, then they just 
not run, but not influence others. Howevery, but from your explanation, when 
I stop any resource in a colocated group, then all resoures have to stop in 
that group, not just only the resources in the list after the one I 
explicitly stopped. When I stopped the D2 resource, then D3 was stopped too, 
but D1 was kept running. 



> See the comment in the DTD:
> 
> 
> 
> (You can force this to be bidirectional if you set symmetrical to true for 
the
> colocation constraint; I don't think you can set that for groups.)

I am aware of that, thanks. But I wanted to use groups, to not need such a 
lot of constraints.

cheers
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] question regarding orderings in resource groups

2008-02-19 Thread Sebastian Reitenbach
Hi,

as far as I understand groups, the parameter ordered means, when set to yes, 
that the resources in the group are started and stopped in the order that 
they appear in the CIB. The collocated parameter means, that when set to 
yes, all resources in a group run on the same cluster node.

I just created the following resource group:

 
   
 
   
 
 
 
   
 
 
   
 
   
 
   
 
 
   
 
   
 
   
 
 
   
 


there ordered is set to false. I have the group running, and when I then 
e.g. want to stop the resource D2, then D3 stops too. Only when I change 
collocated to false, then D3 keeps running when I stop D2.

Seems to be not working as I understood it. Am I missing anything important 
here, or maybe just a bug?

I'm on sles10sp1, using heartbeat 2.1.3.

cheers
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] problems with quorumd

2008-02-07 Thread Sebastian Reitenbach
Hi,
Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> Hi,
> 
> On Thu, Feb 07, 2008 at 05:00:14PM +0100, Sebastian Reitenbach wrote:
> > Hi,
> > 
> > I have a 4 node cluster, and wanted to setup a quorum server, so that I 
do 
> > not need three running cluster nodes to get quorum. The quorumd IP 
address 
> > is a shared IP on another two node cluster. 
> > 
> > I've done the following tests, the quorumd from a 2.1.2 version of 
> > heartbeat, the cluster nodes had 2.1.3 version:
> > 
> > 
> > 
> > start quorumd 
> > start first cluster node -> (node becomes DC, contacting the quorum) 
cluster 
> > gets quorm
> > start second cluster node -> cluster still has quorum
> > stop DC, -> see other node becoming DC, and contacting quorum server, 
> > cluster still has quorum
> > kill quorumd, then see RST packets going back to cluster node (the DC 
tries 
> > to contact the quorumd every second) -> cluster still has quorum
> > wait 5 minutes -> cluster still has quorum
> > try to start stop a node, resource, add or remove a resource -> this 
works, 
> > then the cluster recognizes the lost quorum
> 
> After any of these actions the cluster looses quorum? Or is it
> just after the node restart?
I added a dummy resource, at a time when the quorumd was not reachable, The 
resource got created. The defautl target role is stopped, so the Dummy was 
stopped. Before I was able to make the dummy active, the cluster recognized 
that it lost quorum and refused to make the Dummy active.

> 
> > then restart the quorumd -> see answers going back from quorumd to DC 
node, 
> > but cluster has no quorum again
> > wait 5 minutes -> cluster still has no quorum again
> 
> I can recall that somebody else already complained about the same
> issue.
most likely me some months ago, fiddling around with 2.1.2 ;)

> 
> > restart heartbeat on one of the cluster nodes -> cluster recognizes the 
> > availablility of quorumd and gets quorum again
> > 
> > Setting a node to standby, does not make the cluster recognize that the 
> > quorum got lost, or is available again.
> > 
> > I also have seen, when there is a firewall, that drops packets, instead 
of 
> > answering with RST, when the quorumd is down, then the rate when the DC 
> > tries to reconnect to the quorumd drops to about once a minute, but that 
is 
> > OK, as I'd guess its waiting for timeouts.
> 
> Yes, looks like a TCP/IP property.
> 
> > So in my eyes, using a quorumd does more harm than being useful, but ma
> > did sth. wrong?
> 
> Since it has been working, you probably set it up ok. You should
> open a bugzilla for this. Sorry that I can't offer more help on
> the matter now.
> 
> BTW, did you also test a split brain situation where one of the
> nodes can talk to the quorumd?
no, I now decided, that I run the cluster without quorumd for now.
Nevertheless, I'll create a bugzilla entry.

cheers
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] problems with quorumd

2008-02-07 Thread Sebastian Reitenbach
Hi,

I have a 4 node cluster, and wanted to setup a quorum server, so that I do 
not need three running cluster nodes to get quorum. The quorumd IP address 
is a shared IP on another two node cluster. 

I've done the following tests, the quorumd from a 2.1.2 version of 
heartbeat, the cluster nodes had 2.1.3 version:



start quorumd 
start first cluster node -> (node becomes DC, contacting the quorum) cluster 
gets quorm
start second cluster node -> cluster still has quorum
stop DC, -> see other node becoming DC, and contacting quorum server, 
cluster still has quorum
kill quorumd, then see RST packets going back to cluster node (the DC tries 
to contact the quorumd every second) -> cluster still has quorum
wait 5 minutes -> cluster still has quorum
try to start stop a node, resource, add or remove a resource -> this works, 
then the cluster recognizes the lost quorum
then restart the quorumd -> see answers going back from quorumd to DC node, 
but cluster has no quorum again
wait 5 minutes -> cluster still has no quorum again
restart heartbeat on one of the cluster nodes -> cluster recognizes the 
availablility of quorumd and gets quorum again

Setting a node to standby, does not make the cluster recognize that the 
quorum got lost, or is available again.

I also have seen, when there is a firewall, that drops packets, instead of 
answering with RST, when the quorumd is down, then the rate when the DC 
tries to reconnect to the quorumd drops to about once a minute, but that is 
OK, as I'd guess its waiting for timeouts.

So in my eyes, using a quorumd does more harm than being useful, but maybe I 
did sth. wrong?


cheers
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2.1.3 suse rpm's?

2008-01-24 Thread Sebastian Reitenbach
Hi Andrew,

I just downloaded and configure the heartbeat rpm's from you for sles10 
x86_64 from today. All mentioned problems fixed. But pam authentication to 
login with the gui to the mgmtd did still not worked. I had to 
change /etc/pam.d/hbmgmtd to this:

#%PAM-1.0

[EMAIL PROTECTED] common-auth
[EMAIL PROTECTED] common-account
authrequiredpam_unix2.so
account requiredpam_unix2.so


then it worked fine.

cheers
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2.1.3 suse rpm's?

2008-01-24 Thread Sebastian Reitenbach
Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> 
> On Jan 23, 2008, at 8:31 PM, Sebastian Reitenbach wrote:
> 
> > Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> >>
> >> On Jan 23, 2008, at 7:21 PM, Sebastian Reitenbach wrote:
> >>
> >>> Hi,
> >>> Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> >>>>
> >>>> it was that package - i've added it as a dependancy
> >>>>
> >>> I just installed on opensuse 10.3 i586, and pacemaker-pygui now
> >>> requires the
> >>> pyxml package. Nevertheless, to be able to start hb_gui I still need
> >>> to make
> >>> a symbolic link in /usr/lib/heartbeat-gui/ from _pymgmt.so.0 to
> >>> _pymgmt.so.
> >>
> >> ok, i'll make sure that link gets created
> >>
> >> thanks for your help getting the kinks worked out - i've had very
> >> little to do with the gui and would much prefer to pretend it doesn't
> >> exist :-)
> >
> > yeah, but it is getting better and better with every release.
> 
> true.
> the problem was that IBM pulled all their people off the project at a  
> point when the GUI was barely usable.
> the good news is that some of the folks from Novell China and NTT  
> Japan are getting involved and contributing some really good patches  
> for it.
> 
> >>> Nevertheless, hb_gui is not much useful, as the mgmtd is not able to
> >>> run,
> >>> when starting mgmtd -v the following shows up in the logs:
> >>
> >> it looks like heartbeat doesn't like the user mgmtd is running as
> >>
> >> who are you running that command as?
> > I was root, just ran mgmtd from commandline.
> 
> ok, i see the problem... you need to add the following two lines to  
> ha.cf
> 
> 
> apiauth   mgmtd   uid=root
> respawn   root/usr/lib/heartbeat/mgmtd -v
> 
> 
> these used to be implied by the "crm yes" line but, according to the  
> logic in heartbeat.c, only when heartbeat is built with the mgmtd  
> (which it no longer is)

I added these two lines, and now mgmtd starts.

> 
> i'll add this to the FAQ
> 
> >> what does your ha.cf look like?
> >
> > logfacility local0
> > crm yes
> > cluster OpenBSD-Heartbeat
> > udpport 694
> > ucast eth0 ops.ds9
> > ucast eth0 defiant.ds9
> 
> > auto_failback on
> 
> btw. that line has no meaning in a crm/pacemaker cluster
thanks for pointing out.

cheers
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2.1.3 suse rpm's?

2008-01-23 Thread Sebastian Reitenbach
Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> 
> On Jan 23, 2008, at 7:21 PM, Sebastian Reitenbach wrote:
> 
> > Hi,
> > Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> >>
> >> it was that package - i've added it as a dependancy
> >>
> > I just installed on opensuse 10.3 i586, and pacemaker-pygui now  
> > requires the
> > pyxml package. Nevertheless, to be able to start hb_gui I still need  
> > to make
> > a symbolic link in /usr/lib/heartbeat-gui/ from _pymgmt.so.0 to  
> > _pymgmt.so.
> 
> ok, i'll make sure that link gets created
> 
> thanks for your help getting the kinks worked out - i've had very  
> little to do with the gui and would much prefer to pretend it doesn't  
> exist :-)

yeah, but it is getting better and better with every release.
> 
> >
> > Nevertheless, hb_gui is not much useful, as the mgmtd is not able to  
> > run,
> > when starting mgmtd -v the following shows up in the logs:
> 
> it looks like heartbeat doesn't like the user mgmtd is running as
> 
> who are you running that command as?
I was root, just ran mgmtd from commandline.
> what does your ha.cf look like?

logfacility local0
crm yes
cluster OpenBSD-Heartbeat
udpport 694
ucast eth0 ops.ds9
ucast eth0 defiant.ds9
auto_failback on
nodeops
nodedefiant.ds9
ping 10.0.0.1
use_logd yes

I just copied the config from my OpenBSD box, where this works just fine.



> 
> > Jan 23 19:19:17 ops mgmtd: [29798]: info: G_main_add_SignalHandler:  
> > Added
> > signal handler for signal 15
> > Jan 23 19:19:17 ops mgmtd: [29798]: debug: Enabling coredumps
> > Jan 23 19:19:17 ops mgmtd: [29798]: WARN: Core dumps could be lost if
> > multiple dumps occur.
> > Jan 23 19:19:17 ops mgmtd: [29798]: WARN: Consider setting non- 
> > default value
> > in /proc/sys/kernel/core_pattern (or equivalent) for maximum  
> > supportability
> > Jan 23 19:19:17 ops mgmtd: [29798]: WARN: Consider
> > setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for  
> > maximum
> > supportability
> > Jan 23 19:19:17 ops mgmtd: [29798]: info: G_main_add_SignalHandler:  
> > Added
> > signal handler for signal 10
> > Jan 23 19:19:17 ops mgmtd: [29798]: info: G_main_add_SignalHandler:  
> > Added
> > signal handler for signal 12
> > Jan 23 19:19:17 ops mgmtd: [29798]: ERROR: Cannot sign on with  
> > heartbeat
> > Jan 23 19:19:17 ops mgmtd: [29798]: ERROR: REASON:
> > Jan 23 19:19:17 ops mgmtd: [29798]: ERROR: Can't initialize management
> > library.Shutting down.(-1)
> > Jan 23 19:19:17 ops heartbeat: [29647]: WARN: Client [mgmtd] pid 29798
> > failed authorization [no default client auth]
> > Jan 23 19:19:17 ops heartbeat: [29647]: ERROR:  
> > api_process_registration_msg:
> > cannot add client(mgmtd)
> >
> > This also happened on the SLES10.

tss, accidently sent it to me instead of the list.

Sebastian


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2.1.3 suse rpm's?

2008-01-23 Thread Sebastian Reitenbach
Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> 
> On Jan 22, 2008, at 9:31 AM, Sebastian Reitenbach wrote:
> 
> > Hi,
> >
> > General Linux-HA mailing list  wrote:
> >>
> >> On Jan 21, 2008, at 5:09 PM, Andrew Beekhof wrote:
> >>
> >>>
> >>> On Jan 21, 2008, at 3:52 PM, matilda matilda wrote:
> >>>
> >>>>>>> "Sebastian Reitenbach" <[EMAIL PROTECTED]>
> >>>>>>> 21.01.2008 15:21 >>>
> >>>>> yes, that helped, I installed both rpm's, but now, when I want to
> >>>>> start the
> >>>>> hb_gui, I get the following error:
> >>>>>
> >>>>> Traceback (most recent call last):
> >>>>> File "/usr/bin/hb_gui", line 35, in ?
> >>>>> from pymgmt import *
> >>>>> ImportError: No module named pymgmt
> >>>>>
> >>>>> These files are installed:
> >>>>> find /usr/ -name "*pymgmt*"
> >>>>> /usr/lib64/heartbeat-gui/pymgmt.pyc
> >>>>> /usr/lib64/heartbeat-gui/_pymgmt.so.0
> >>>>> /usr/lib64/heartbeat-gui/_pymgmt.so.0.0.0
> >>>>> /usr/lib64/heartbeat-gui/pymgmt.py
> >>>>> I also created a symbolic link from _pymgmt.so.0 to _pymgmt.so,
> >>>>> because I
> >>>>> found a _pymgmt.so file in the same directory, that was installed
> >>>>> on another
> >>>>> SLES box with heartbeat 2.1.2, but that did not helped.
> >>>>> so I removed the 2.1.3 rpm's and installed the 2.1.2, and the
> >>>>> hb_gui is
> >>>>> working on that box, so at least no basic python stuff seems to be
> >>>>> missing.
> >>>>>
> >>>>> So there must still sth. missing to get the GUI working again, any
> >>>>> more
> >>>>> idea?
> >>>>
> >>>> Hi Sebastian, hi Andrew, hi all,
> >>>>
> >>>> I looked at /usr/bin/hb_gui of version 2.1.3 packed by Andrew.
> >>>> If you look at line 33,34,35 you'll see that the build process
> >>>> didn't replace the build environment variables @HA_DATADIR@
> >>>> and @HA_LIBDIR@ by their values.
> >>>
> >>> ah, well spotted.  i'll get them fixed.
> >>
> >> pushing up some new packages now - give them a moment to rebuild
> >> (fyi: Fedora x86_64 is currently not able to build due to a build
> >> service problem - just grab an i386 src.rpm and do a rpm rebuild)
> >>
> > Ok, I just tried these rpm's:
> > pacemaker-pygui-1.1-2.1
> > pacemaker-heartbeat-0.6.0-15.1
> > heartbeat-common-2.1.3-3.2
> > heartbeat-resources-2.1.3-3.2
> > heartbeat-ldirectord-2.1.3-3.1
> > heartbeat-2.1.3-3.2
> >
> > now I get the following error message when I try to start hb_gui:
> > hb_gui
> > Traceback (most recent call last):
> >  File "/usr/bin/hb_gui", line 29, in ?
> >from xml.parsers.xmlproc.xmldtd import load_dtd_string
> > ImportError: No module named xmlproc.xmldtd
> 
> I think you need the pyxml package for this
> 
> Can you confirm that for me?  If so I'll add it to the spec file as a  
> dependancy.

I had no time to check today, I had to downgrade to 2.1.2 yesterday, to keep 
me going. What I can say is that pyxml was not installed. The 2.1.2 is 
working, so this must be a new dependency then, but I think there were some 
changes to the GUI, regarding parsing the dtd, so you might be right.
I hope I find some time tomorrow to retest. 


Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2.1.3 suse rpm's?

2008-01-23 Thread Sebastian Reitenbach
Hi,
Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> 
> On Jan 22, 2008, at 9:35 AM, Sebastian Reitenbach wrote:
> 
> > deinstalling heartbeat-2.1.3-3.2 only works with --noscripts because  
> > of the
> > following error:
> > /usr/lib64/heartbeat/heartbeat: error while loading shared libraries:
> > libstonith.so.1: cannot open shared object file: No such file or  
> > directory
> > ..failed
> > error: %preun(heartbeat-2.1.3-3.2.x86_64) scriptlet failed, exit  
> > status 1
> 
> hmmm
> 
> I see
> %restart_on_update heartbeat
> 
> in the postun section which looks suspicious, but preun looks sane  
> enough:
> 
> %preun
> %if 0%{?suse_version}
>%stop_on_removal heartbeat
> %endif
> %if 0%{?fedora_version}
>/sbin/chkconfig --del heartbeat
> %endif
> 
> 
> was heartbeat-common still installed at the time?
> 
first I tried to deinstall all together via rpm -e ...
then it failed for the heartbeat- package, the rest was not installed 
anymore. Then I retried to deinstall heartbeat- again, but it failed again 
with above error. Then only a rpm -e --noscripts helped.

Sebastian




___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2.1.3 suse rpm's?

2008-01-22 Thread Sebastian Reitenbach
Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA 
mailing list  wrote: 
> Hi,
> 
> General Linux-HA mailing list  wrote: 
> > 
> > On Jan 21, 2008, at 5:09 PM, Andrew Beekhof wrote:
> > 
> > >
> > > On Jan 21, 2008, at 3:52 PM, matilda matilda wrote:
> > >
> > >>>>> "Sebastian Reitenbach" <[EMAIL PROTECTED]>  
> > >>>>> 21.01.2008 15:21 >>>
> > >>> yes, that helped, I installed both rpm's, but now, when I want to  
> > >>> start the
> > >>> hb_gui, I get the following error:
> > >>>
> > >>> Traceback (most recent call last):
> > >>> File "/usr/bin/hb_gui", line 35, in ?
> > >>>  from pymgmt import *
> > >>> ImportError: No module named pymgmt
> > >>>
> > >>> These files are installed:
> > >>> find /usr/ -name "*pymgmt*"
> > >>> /usr/lib64/heartbeat-gui/pymgmt.pyc
> > >>> /usr/lib64/heartbeat-gui/_pymgmt.so.0
> > >>> /usr/lib64/heartbeat-gui/_pymgmt.so.0.0.0
> > >>> /usr/lib64/heartbeat-gui/pymgmt.py
> > >>> I also created a symbolic link from _pymgmt.so.0 to _pymgmt.so,  
> > >>> because I
> > >>> found a _pymgmt.so file in the same directory, that was installed  
> > >>> on another
> > >>> SLES box with heartbeat 2.1.2, but that did not helped.
> > >>> so I removed the 2.1.3 rpm's and installed the 2.1.2, and the  
> > >>> hb_gui is
> > >>> working on that box, so at least no basic python stuff seems to be  
> > >>> missing.
> > >>>
> > >>> So there must still sth. missing to get the GUI working again, any  
> > >>> more
> > >>> idea?
> > >>
> > >> Hi Sebastian, hi Andrew, hi all,
> > >>
> > >> I looked at /usr/bin/hb_gui of version 2.1.3 packed by Andrew.
> > >> If you look at line 33,34,35 you'll see that the build process
> > >> didn't replace the build environment variables @HA_DATADIR@
> > >> and @HA_LIBDIR@ by their values.
> > >
> > > ah, well spotted.  i'll get them fixed.
> > 
> > pushing up some new packages now - give them a moment to rebuild
> > (fyi: Fedora x86_64 is currently not able to build due to a build  
> > service problem - just grab an i386 src.rpm and do a rpm rebuild)
> > 
> Ok, I just tried these rpm's:
> pacemaker-pygui-1.1-2.1
> pacemaker-heartbeat-0.6.0-15.1
> heartbeat-common-2.1.3-3.2
> heartbeat-resources-2.1.3-3.2
> heartbeat-ldirectord-2.1.3-3.1
> heartbeat-2.1.3-3.2
> 
> now I get the following error message when I try to start hb_gui:
> hb_gui
> Traceback (most recent call last):
>   File "/usr/bin/hb_gui", line 29, in ?
> from xml.parsers.xmlproc.xmldtd import load_dtd_string
> ImportError: No module named xmlproc.xmldtd
> 
> I am on SLES10SP1 x86_64

deinstalling heartbeat-2.1.3-3.2 only works with --noscripts because of the 
following error:
/usr/lib64/heartbeat/heartbeat: error while loading shared libraries: 
libstonith.so.1: cannot open shared object file: No such file or directory
..failed
error: %preun(heartbeat-2.1.3-3.2.x86_64) scriptlet failed, exit status 1

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2.1.3 suse rpm's?

2008-01-22 Thread Sebastian Reitenbach
Hi,

General Linux-HA mailing list  wrote: 
> 
> On Jan 21, 2008, at 5:09 PM, Andrew Beekhof wrote:
> 
> >
> > On Jan 21, 2008, at 3:52 PM, matilda matilda wrote:
> >
> >>>>> "Sebastian Reitenbach" <[EMAIL PROTECTED]>  
> >>>>> 21.01.2008 15:21 >>>
> >>> yes, that helped, I installed both rpm's, but now, when I want to  
> >>> start the
> >>> hb_gui, I get the following error:
> >>>
> >>> Traceback (most recent call last):
> >>> File "/usr/bin/hb_gui", line 35, in ?
> >>>  from pymgmt import *
> >>> ImportError: No module named pymgmt
> >>>
> >>> These files are installed:
> >>> find /usr/ -name "*pymgmt*"
> >>> /usr/lib64/heartbeat-gui/pymgmt.pyc
> >>> /usr/lib64/heartbeat-gui/_pymgmt.so.0
> >>> /usr/lib64/heartbeat-gui/_pymgmt.so.0.0.0
> >>> /usr/lib64/heartbeat-gui/pymgmt.py
> >>> I also created a symbolic link from _pymgmt.so.0 to _pymgmt.so,  
> >>> because I
> >>> found a _pymgmt.so file in the same directory, that was installed  
> >>> on another
> >>> SLES box with heartbeat 2.1.2, but that did not helped.
> >>> so I removed the 2.1.3 rpm's and installed the 2.1.2, and the  
> >>> hb_gui is
> >>> working on that box, so at least no basic python stuff seems to be  
> >>> missing.
> >>>
> >>> So there must still sth. missing to get the GUI working again, any  
> >>> more
> >>> idea?
> >>
> >> Hi Sebastian, hi Andrew, hi all,
> >>
> >> I looked at /usr/bin/hb_gui of version 2.1.3 packed by Andrew.
> >> If you look at line 33,34,35 you'll see that the build process
> >> didn't replace the build environment variables @HA_DATADIR@
> >> and @HA_LIBDIR@ by their values.
> >
> > ah, well spotted.  i'll get them fixed.
> 
> pushing up some new packages now - give them a moment to rebuild
> (fyi: Fedora x86_64 is currently not able to build due to a build  
> service problem - just grab an i386 src.rpm and do a rpm rebuild)
> 
Ok, I just tried these rpm's:
pacemaker-pygui-1.1-2.1
pacemaker-heartbeat-0.6.0-15.1
heartbeat-common-2.1.3-3.2
heartbeat-resources-2.1.3-3.2
heartbeat-ldirectord-2.1.3-3.1
heartbeat-2.1.3-3.2

now I get the following error message when I try to start hb_gui:
hb_gui
Traceback (most recent call last):
  File "/usr/bin/hb_gui", line 29, in ?
from xml.parsers.xmlproc.xmldtd import load_dtd_string
ImportError: No module named xmlproc.xmldtd

I am on SLES10SP1 x86_64

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: Re: [Linux-HA] 2.1.3 suse rpm's?

2008-01-21 Thread Sebastian Reitenbach
"matilda matilda" <[EMAIL PROTECTED]> wrote: 
> >>> "Sebastian Reitenbach" <[EMAIL PROTECTED]> 21.01.2008 
15:21 >>>
> > yes, that helped, I installed both rpm's, but now, when I want to start 
the 
> > hb_gui, I get the following error:
> >
> >Traceback (most recent call last):
> >  File "/usr/bin/hb_gui", line 35, in ?
> >from pymgmt import *
> >ImportError: No module named pymgmt
> >
> >These files are installed:
> >find /usr/ -name "*pymgmt*"
> >/usr/lib64/heartbeat-gui/pymgmt.pyc
> >/usr/lib64/heartbeat-gui/_pymgmt.so.0
> >/usr/lib64/heartbeat-gui/_pymgmt.so.0.0.0
> >/usr/lib64/heartbeat-gui/pymgmt.py
> >I also created a symbolic link from _pymgmt.so.0 to _pymgmt.so, because I 
> >found a _pymgmt.so file in the same directory, that was installed on 
another 
> >SLES box with heartbeat 2.1.2, but that did not helped.
> >so I removed the 2.1.3 rpm's and installed the 2.1.2, and the hb_gui is 
> >working on that box, so at least no basic python stuff seems to be 
missing.
> >
> >So there must still sth. missing to get the GUI working again, any more 
> >idea?
> 
> Hi Sebastian, hi Andrew, hi all,
> 
> I looked at /usr/bin/hb_gui of version 2.1.3 packed by Andrew.
> If you look at line 33,34,35 you'll see that the build process
> didn't replace the build environment variables @HA_DATADIR@
> and @HA_LIBDIR@ by their values. 
> Without that python inlcude path the modules necessary for
> the rest are not found. That's the reason for the import error.
> Version 2.1.2 does have for 32bit:
> -8<
> sys.path.append("/usr/share/heartbeat-gui")
> sys.path.append("/usr/lib/heartbeat-gui")
> from pymgmt import *
> -8<
> 
> 
Thanks for that hint, below these lines, I found a lot more @HA_DATADIR@, 
replacing all of them, and replacing the @HA_LIBDIR@ with /usr/lib64. I also 
had to create a symbolic link in /usr/lib64/heartbeat-gui, from _pymgmt.so.0 
to _pymgmt.so, after doing all that, the GUI started up.

cheers
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2.1.3 suse rpm's?

2008-01-21 Thread Sebastian Reitenbach
Hi,
Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> 
> On Jan 21, 2008, at 12:52 PM, Sebastian Reitenbach wrote:
> 
> > Hi,
> >
> > Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA
> > mailing list  wrote:
> >> Hi,
> >> Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> >>>
> >>> On Jan 10, 2008, at 6:16 PM, Sebastian Reitenbach wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> on the download area, there are the pointers to the suse build
> >>>> service,
> >>>> providing rpms for opensuse versions and others. But there is still
> >>>> only
> >>>> heartbeat-2.1.2- sth.
> >>>> Is this intentional or is there sth. wrong?
> >>>
> >>> they're pretty close to what ended up in 2.1.3
> >>> i'll update them shortly when I do the first pacemaker release
> >>>
> >> ah, ok, that's fine.
> > I just wanted use the rpm packages for heartbeat 2.1.3 on SLES10SP1.
> >
> > I installed these rpm packets:
> > heartbeat-2.1.3-3.1.x86_64.rpm
> > heartbeat-ldirectord-2.1.3-3.1.x86_64.rpm
> > heartbeat-common-2.1.3-3.1.x86_64.rpm
> > heartbeat-resources-2.1.3-3.1.x86_64.rpm
> >
> > and had to find out, that the heartbeat-gui and the crm stuff, and  
> > maybe
> > more, seems to be missing. Is that intentionally left out, as the  
> > rpm names
> > also changed a bit? I thought the heartbeat 2.1.3 version is the  
> > last one
> > where the crm is still in heartbeat?
> 
> Now that Pacemaker 0.6.0 is out, the built-in CRM is no longer  
> supported (all bugs will be fixed in Pacemaker).
> Thus the Heartbeat packages on the build service are built without the  
> built-in CRM.
> Check the changelog in the Heartbeat package for exactly what is no  
> longer included.
yeah, my fault to not read that file ;)


> 
> One thing I like about .deb is that you can recommend other packages  
> to install.
> This would have alerted to you that something was missing - alas there  
> is no such mechanism for rpm.
> 
> > Do I have to install the pacemaker-* and openais* rpm's to get the
> > functionality back? How well is the pacemaker-* stuff tested, is  
> > that ready
> > for production systems, or should I better stay with a heartbeat  
> > 2.1.2 or
> > the versions that install with SLES10SP1?
> 
> If you only wish to use the heartbeat stack, you only need one extra  
> rpm: pacemaker-heartbeat
yeah, just basically the same functionality as with 2.1.2 is wanted.

> 
> It contains essentially the same CRM code that was in 2.1.3 and is no  
> more/less production ready than what was in 2.1.3.
> You can see the testing criteria for releases at:
> http://www.clusterlabs.org/mw/Release_Testing
> 
> For the GUI, you'll need the pacemaker-pygui package.
> 
> The list of packages is described at:
> http://www.clusterlabs.org/mw/Install#Package_List
> 
> hope that helps
yes, that helped, I installed both rpm's, but now, when I want to start the 
hb_gui, I get the following error:

Traceback (most recent call last):
  File "/usr/bin/hb_gui", line 35, in ?
from pymgmt import *
ImportError: No module named pymgmt

These files are installed:
find /usr/ -name "*pymgmt*"
/usr/lib64/heartbeat-gui/pymgmt.pyc
/usr/lib64/heartbeat-gui/_pymgmt.so.0
/usr/lib64/heartbeat-gui/_pymgmt.so.0.0.0
/usr/lib64/heartbeat-gui/pymgmt.py
I also created a symbolic link from _pymgmt.so.0 to _pymgmt.so, because I 
found a _pymgmt.so file in the same directory, that was installed on another 
SLES box with heartbeat 2.1.2, but that did not helped.
so I removed the 2.1.3 rpm's and installed the 2.1.2, and the hb_gui is 
working on that box, so at least no basic python stuff seems to be missing.

So there must still sth. missing to get the GUI working again, any more 
idea?

cheers
Sebastian


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2.1.3 suse rpm's?

2008-01-21 Thread Sebastian Reitenbach
Hi,

Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> Hi,
> 
> On Mon, Jan 21, 2008 at 12:52:37PM +0100, Sebastian Reitenbach wrote:
> > Hi,
> > 
> > Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA 
> > mailing list  wrote: 
> > > Hi,
> > > Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> > > > 
> > > > On Jan 10, 2008, at 6:16 PM, Sebastian Reitenbach wrote:
> > > > 
> > > > > Hi,
> > > > >
> > > > > on the download area, there are the pointers to the suse build  
> > > > > service,
> > > > > providing rpms for opensuse versions and others. But there is 
still  
> > > > > only
> > > > > heartbeat-2.1.2- sth.
> > > > > Is this intentional or is there sth. wrong?
> > > > 
> > > > they're pretty close to what ended up in 2.1.3
> > > > i'll update them shortly when I do the first pacemaker release
> > > > 
> > > ah, ok, that's fine.
> > I just wanted use the rpm packages for heartbeat 2.1.3 on SLES10SP1.
> > 
> > I installed these rpm packets:
> > heartbeat-2.1.3-3.1.x86_64.rpm 
> > heartbeat-ldirectord-2.1.3-3.1.x86_64.rpm
> > heartbeat-common-2.1.3-3.1.x86_64.rpm  
> > heartbeat-resources-2.1.3-3.1.x86_64.rpm
> > 
> > and had to find out, that the heartbeat-gui and the crm stuff, and maybe 
> > more, seems to be missing. Is that intentionally left out, as the rpm 
names 
> > also changed a bit?
> 
> The gui should be in a separate package. Isn't there such a
> package?

http://download.opensuse.org/repositories/server:/ha-clustering/SLES_10/x86_64/
there is only a pacemaker-pygui-1.0.0-5.4.x86_64.rpm.
I downloaded it, and wanted to install via rpm, but then libcrm, and two 
others are missing.

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2.1.3 suse rpm's?

2008-01-21 Thread Sebastian Reitenbach
Hi,

Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA 
mailing list  wrote: 
> Hi,
> Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> > 
> > On Jan 10, 2008, at 6:16 PM, Sebastian Reitenbach wrote:
> > 
> > > Hi,
> > >
> > > on the download area, there are the pointers to the suse build  
> > > service,
> > > providing rpms for opensuse versions and others. But there is still  
> > > only
> > > heartbeat-2.1.2- sth.
> > > Is this intentional or is there sth. wrong?
> > 
> > they're pretty close to what ended up in 2.1.3
> > i'll update them shortly when I do the first pacemaker release
> > 
> ah, ok, that's fine.
I just wanted use the rpm packages for heartbeat 2.1.3 on SLES10SP1.

I installed these rpm packets:
heartbeat-2.1.3-3.1.x86_64.rpm 
heartbeat-ldirectord-2.1.3-3.1.x86_64.rpm
heartbeat-common-2.1.3-3.1.x86_64.rpm  
heartbeat-resources-2.1.3-3.1.x86_64.rpm

and had to find out, that the heartbeat-gui and the crm stuff, and maybe 
more, seems to be missing. Is that intentionally left out, as the rpm names 
also changed a bit? I thought the heartbeat 2.1.3 version is the last one 
where the crm is still in heartbeat?

Do I have to install the pacemaker-* and openais* rpm's to get the 
functionality back? How well is the pacemaker-* stuff tested, is that ready 
for production systems, or should I better stay with a heartbeat 2.1.2 or 
the versions that install with SLES10SP1?


thanks 
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] cl_status listnodes does not honour -n or -p onOpenBSD

2008-01-16 Thread Sebastian Reitenbach
Hi,

Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> Hi,
> 
> On Wed, Jan 16, 2008 at 10:23:17AM +0100, Sebastian Reitenbach wrote:
> > Hi,
> > 
> > I wanted to use cl_status to get a list of the cluster nodes, without 
the 
> > ping nodes in the list on OpenBSD, but unfortunately, 
> > cl_status listnodes -n 
> > shows the ping nodes too, also with -p parameter, all nodes are shown. 
> > I use heartbeat 2.1.3 on OpenBSD.
> > I tried the same on a SLES10 with heartbeat 2.1.2 installed. There this 
> > command is working as documented.
> > Is this generally working with heartbeat 2.1.3, or is it just a problem 
with 
> > OpenBSD?
> > anybody could test on a different OS with HB 2.1.3 and let me know 
whether 
> > it works.
> 
> On openSUSE 10.3 it works without problems. Looks like it's
> OpenBSD specific.
Thanks a lot for testing, then its my problem. I'll open a bug report and 
will try to fix.

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] cl_status listnodes does not honour -n or -p on OpenBSD

2008-01-16 Thread Sebastian Reitenbach
Hi,

I wanted to use cl_status to get a list of the cluster nodes, without the 
ping nodes in the list on OpenBSD, but unfortunately, 
cl_status listnodes -n 
shows the ping nodes too, also with -p parameter, all nodes are shown. 
I use heartbeat 2.1.3 on OpenBSD.
I tried the same on a SLES10 with heartbeat 2.1.2 installed. There this 
command is working as documented.
Is this generally working with heartbeat 2.1.3, or is it just a problem with 
OpenBSD?
anybody could test on a different OS with HB 2.1.3 and let me know whether 
it works.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2.1.3 suse rpm's?

2008-01-10 Thread Sebastian Reitenbach
Hi,
Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> 
> On Jan 10, 2008, at 6:16 PM, Sebastian Reitenbach wrote:
> 
> > Hi,
> >
> > on the download area, there are the pointers to the suse build  
> > service,
> > providing rpms for opensuse versions and others. But there is still  
> > only
> > heartbeat-2.1.2- sth.
> > Is this intentional or is there sth. wrong?
> 
> they're pretty close to what ended up in 2.1.3
> i'll update them shortly when I do the first pacemaker release
> 
ah, ok, that's fine.

thanks
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] 2.1.3 suse rpm's?

2008-01-10 Thread Sebastian Reitenbach
Hi,

on the download area, there are the pointers to the suse build service, 
providing rpms for opensuse versions and others. But there is still only 
heartbeat-2.1.2- sth. 
Is this intentional or is there sth. wrong?


Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] question regarding quorumd

2007-11-20 Thread Sebastian Reitenbach
Hi,
Zhen Huang <[EMAIL PROTECTED]> wrote: 
> Hi,
> 
> The DC node should try to connect to the quorumd sever periodically.
> If not, it should be a bug. 

I observed this behavior first on a two node Linux cluster. I just did some 
more tests with a two node OpenBSD cluster, and the quorumd on a Linux box.

The following I observed, test 1:
- configure usage of quorumd on the two heartbeat nodes
- start quorumd on the Linux node
- start the first cluster node
   - this is starting communication with quorumd, it gets quorum, and I can 
start managing resources
- start the second cluster node, and everything is still working well
- stop the quorumd
   - the DC still sends packets to the quorumd, for about a minute, then
 stops and never starts again, also the other node, does not start
 trying to contact the quorumd
- then kill one of the cluster nodes, then the remaining node tries to 
  contact the quorumd, fails because it is not running, and the left node is   
  without quorum

Test 2:
- configure usage of quorumd on the two heartbeat nodes
- do NOT start quorumd on the Linux node
- start the first cluster node, see it failing to contact quorumd, 
  it is starting up the cluster without quorum (it only sends one packet to 
  the quorumd, receives a RST package, and seems to never try again)
- start the second cluster node, this seems to trigger the DC to retry 
  contacting the quorumd, (again, only one package, then nothing more)
- both cluster nodes then together decide that the cluster runs without 
  quorum. Shouldn't the two cluster nodes be enough to aquire quorum?
- start the quorumd on the Linux box
- wait forever, see that the cluster nodes not try to contact the quorumd 
  again, therefore the cluster keeps thinking, it has no quorum at all.


As said, last week I observed that initially on a two node Linux test 
cluster with a third node running a quorumd, so it not seems to be OS 
related.


kind regards
Sebastian

> 
> Sebastian Reitenbach wrote:
> > Hi,
> > 
> > Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> >> On Nov 13, 2007, at 11:13 AM, Sebastian Reitenbach wrote:
> >>
> >>> Hi,
> >>>
> >>> Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> >>>> On Nov 9, 2007, at 4:34 PM, Sebastian Reitenbach wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I did some tests with a two node cluster and a third one running a
> >>>>> quorumd.
> >>>>>
> >>>>> I started the quorumd, and then the two cluster nodes.
> >>>>> The one that became DC, started to communicate with the remote
> >>>>> quorumd.
> >>>> The CRM (and thus the "DC") doesn't know anything about quorumd
> >>>> I believe this is purely the domain of the CCM and I've no idea how
> >>>> that works :-)
> >>>>
> >>>> We just consume membership data from it...
> >>>>
> >>>> So anyway, my point is that the fact that a node is the DC is
> >>>> irrelevant when it comes to quorumd.
> >>> but somehow the cluster knows, as only the DC is communicating with 
> >>> the
> >>> external quorumd.
> >> I think that its just a co-incidence that it happens to be the DC... 
> >> at least I hope it is.
> > I thought I read somewhere, that the DC is the one in charge of 
> > communicating with the remote quorumd, but I may be wrong here.
> > 
> >>> I just do not understand, why the cluster does not retry
> >>> to re-contact the quorumd after it lost connection to it. This was 
> >>> what I
> >>> assumed, after a disconnect to the remote quorumd, the cluster nodes 
> >>> should
> >>> try to contact it, and when the contact is there again, use it again.
> >> I agree - but I've never seen that code.  You'll have to contact alan 
> >> or file a bug for him.
> > Alan, in case you think this is a bug, I'll go create a bug report for 
> it.
> > Please let me know.
> > 
> >>>>> I killed the DC, saw the other becoming DC, and start communicating
> >>>>> to the remote quorumd, all fine, cluster still with quorum.
> >>>>> Then I killed the quorumd itself, the DC recognized, and started to
> >>>>> stop
> >>>>> all resource, because of the quorum_policy, as it lost quorum.
> >>>>>
> >>>>> Then I restarted the quorumd again, but the DC, still without 
> >>>>> quorum,
> >>>>> did not tried to communicate to the 

Re: [Linux-HA] question regarding quorumd

2007-11-19 Thread Sebastian Reitenbach
Hi,

Zhen Huang <[EMAIL PROTECTED]> wrote: 
> Hi,
> 
> The DC node should try to connect to the quorumd sever periodically.
> If not, it should be a bug. 

Thanks for clarifying, I'll retest later today when I'm back at home, when I 
can reproduce, I'll open a bugzilla entry.

kind regards
Sebastian

> 
>
> Alan Robertson <[EMAIL PROTECTED]> 
> 11/14/2007 03:13 AM
> 
> To
> Sebastian Reitenbach <[EMAIL PROTECTED]>
> cc
> linux-ha@lists.linux-ha.org, Zhen Huang/China/[EMAIL PROTECTED]
> Subject
> Re: [Linux-HA] question regarding quorumd
> 
> 
> 
> 
> 
> 
> Sebastian Reitenbach wrote:
> > Hi,
> > 
> > Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> >> On Nov 13, 2007, at 11:13 AM, Sebastian Reitenbach wrote:
> >>
> >>> Hi,
> >>>
> >>> Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> >>>> On Nov 9, 2007, at 4:34 PM, Sebastian Reitenbach wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I did some tests with a two node cluster and a third one running a
> >>>>> quorumd.
> >>>>>
> >>>>> I started the quorumd, and then the two cluster nodes.
> >>>>> The one that became DC, started to communicate with the remote
> >>>>> quorumd.
> >>>> The CRM (and thus the "DC") doesn't know anything about quorumd
> >>>> I believe this is purely the domain of the CCM and I've no idea how
> >>>> that works :-)
> >>>>
> >>>> We just consume membership data from it...
> >>>>
> >>>> So anyway, my point is that the fact that a node is the DC is
> >>>> irrelevant when it comes to quorumd.
> >>> but somehow the cluster knows, as only the DC is communicating with 
> >>> the
> >>> external quorumd.
> >> I think that its just a co-incidence that it happens to be the DC... 
> >> at least I hope it is.
> > I thought I read somewhere, that the DC is the one in charge of 
> > communicating with the remote quorumd, but I may be wrong here.
> > 
> >>> I just do not understand, why the cluster does not retry
> >>> to re-contact the quorumd after it lost connection to it. This was 
> >>> what I
> >>> assumed, after a disconnect to the remote quorumd, the cluster nodes 
> >>> should
> >>> try to contact it, and when the contact is there again, use it again.
> >> I agree - but I've never seen that code.  You'll have to contact alan 
> >> or file a bug for him.
> > Alan, in case you think this is a bug, I'll go create a bug report for 
> it.
> > Please let me know.
> > 
> >>>>> I killed the DC, saw the other becoming DC, and start communicating
> >>>>> to the remote quorumd, all fine, cluster still with quorum.
> >>>>> Then I killed the quorumd itself, the DC recognized, and started to
> >>>>> stop
> >>>>> all resource, because of the quorum_policy, as it lost quorum.
> >>>>>
> >>>>> Then I restarted the quorumd again, but the DC, still without 
> >>>>> quorum,
> >>>>> did not tried to communicate to the quorumd again.
> >>>>> I'd expect the still living DC to try to contact the quorumd, in
> >>>>> case it
> >>>>> comes back.
> >>>>>
> >>>>> If there is a good reason, why the DC is not trying to reconnect to
> >>>>> the
> >>>>> remote quorumd I'd really like to get enlightened from someone who
> >>>>> knows.
> 
> It should be trying to reconnect.  It _does_ communicate w/quorumd from
> a single machine/cluster.  I think that it's coincidence that it's the
> DC.  Huang Zhen wrote the code.  I've CCed him.  I'm at the LISA
> conference this week - if HZ doesn't get back to you by next Monday,
> I'll look into it.
> 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Xen & HA-clustering

2007-11-14 Thread Sebastian Reitenbach
On Thursday 15 November 2007 03:28:57 sadegh wrote:
> Hi All,
> How I can add xen to an HA-Cluster?
I use some SAN devices, presented to my cluster nodes, and have xen 
instances 
configured to live on it. Then you just only need to add a Xen resource to 
your cluster.

> what is your idea about changing failover mechanism from stop/restart to
> live-migration?
I played a bit with live migration in linux-ha, which works in general, but
has some issues, nevertheless,
start/stop takes about 45 seconds, migration takes about 30 seconds.
Migration does not work in case of a failover, so would only be useful 
in a mainenance time. In my eyes, not much gained with live migration.

you might want to try the updated Xen resource script for linux-ha:
http://developerbugs.linux-foundation.org//show_bug.cgi?id=1778

It will allow you to monitor services within the Xen domU, and allow some 
simple memory management when you start/stop a domU.
Comments and test reports are welcome.


> very appreciate to have answer from you!
> Best Regards
> Sadegh Hooshmand
Kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ECCN classification for Linux-HA Heartbeat

2007-11-14 Thread Sebastian Reitenbach
Hi,

General Linux-HA mailing list  wrote: 
> On 2007-11-13T14:18:50, "Henriques, Tiago" <[EMAIL PROTECTED]> 
wrote:
> 
> > We are using Linux-HA Heartbeat in one of our products, and are now in
> > the process of collecting the information needed to export it to other
> > countries.
> > 
> > In order to do this, can you tell me whether any citizens of the United
> > States of America or people living in the U.S.A. have contributed to the
> > Linux-HA Heartbeat software?
> 
> Yes, heavily. US businesses, too.
> 
> > Can you also tell me what the U.S. Export Control Classification Number
> > (ECCN) for Linux-HA Heartbeat is, and whether a license exception may be
> > used for it? 
> 
> No idea. My very limitted understanding is that this is merely a
> component and requires an aggregate ECCN.
> 
> Heartbeat itself does not appear to be subject to any special export
> restrictions from the US, as it doesn't use nor provide encryption (just
> digital signatures).
Communication between cluster and quorumd requires to use X509 certificates.
Don't know wheter that matters for you.
> 
> I'd recommend that asking a lawyer is the best path forward.

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] migration/fence after fail-count > X

2007-11-13 Thread Sebastian Reitenbach
Hi, 

Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> 
> On Nov 13, 2007, at 1:02 PM, Sebastian Reitenbach wrote:
> 
> > Hi,
> >
> > I read in the v2 FAQ the following:
> >
> > What happens when monitor detects the resource down?
> > The node will try to restart the resource, but if this fails, it  
> > will fail
> > over to an other node.
> > A feature that allows failover after N failures in a given period of  
> > time is
> > planned.
> >
> > Is that feature still planned?
> 
> thats how it works already - sort of.
> there is a layer of indirection with resource-failcount-stickiness,  
> but basically once failcount hits a threshold - the resource moves.
> 
> knowing what to set resource-failcount-stickiness to can be tricky.
> one of the easiest, i can turn my brain off, ways is:
> 1) to start the cluster and make sure everything is running
> 2) figure out the current score (see conversations regarding the  
> getscores.sh script that has been posted here)
Ah, I need to look for that.

> 3) divide said score by X and add 1
> 
> > Could it also be instead of failover, fence the node X when  
> > failcount > X?
> 
> no, at least not yet anyway
> 
> interesting idea though
I think that would be a viable option for resources that could get damaged 
or produce confusion, when started multiple times in a cluster, e.g. Xen 
domU's, non cluster aware Filesystems, IP addresses...

> 
> > Or is that working already, and the FAQ is not upated?
> > At least when I see this:
> > http://www.linux-ha.org/v2/faq/forced_failover
> > It seems to work already, but only in combination with moving a  
> > resource to
> > another location, but not to be used to fence a node after a critical
> > fail-count is reached.
> > I've seen the fail_count utility, and tried to find examples on the  
> > webpage,
> > but that search was not too exhaustive.
> >
> > Also, can the fail-count of different resources be summed up to make a
> > decision in combination with fencing? E.g. Resources A, B, C...
> > The failcount of A=3, + B=4 = SUM=7 > 6, then fecnce the node where  
> > that
> > limit is reached.
> 
> as above. not at the moment
> 
Thanks for the input. I'll open some enhancement requests in the bugzilla 
later today for the two not possible ways.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] migration/fence after fail-count > X

2007-11-13 Thread Sebastian Reitenbach
Hi,

I read in the v2 FAQ the following:

What happens when monitor detects the resource down?
The node will try to restart the resource, but if this fails, it will fail 
over to an other node. 
A feature that allows failover after N failures in a given period of time is 
planned.

Is that feature still planned? Could it also be instead of failover, fence 
the node X when failcount > X?

Or is that working already, and the FAQ is not upated?
At least when I see this:
http://www.linux-ha.org/v2/faq/forced_failover
It seems to work already, but only in combination with moving a resource to 
another location, but not to be used to fence a node after a critical 
fail-count is reached.
I've seen the fail_count utility, and tried to find examples on the webpage, 
but that search was not too exhaustive.

Also, can the fail-count of different resources be summed up to make a 
decision in combination with fencing? E.g. Resources A, B, C... 
The failcount of A=3, + B=4 = SUM=7 > 6, then fecnce the node where that 
limit is reached.

Kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] log warnings, but when I check no error seems to be there

2007-11-13 Thread Sebastian Reitenbach
Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> 
> On Nov 13, 2007, at 10:36 AM, Sebastian Reitenbach wrote:
> 
> > Hi,
> >
> > I see a lot of these messages in my logfile:
> >
> > pengine[12757]: 2007/11/13_10:27:02 WARN: process_pe_message:  
> > Transition
> > 7687: WARNINGs found during PE processing. PEngine Input stored
> > in: /var/lib/heartbeat/pengine/pe-warn-8072.bz2
> > pengine[12757]: 2007/11/13_10:27:02 info: process_pe_message:  
> > Configuration
> > WARNINGs found during PE processing.  Please run "crm_verify -L" to  
> > identify
> > issues.
> >
> > but when I check crm_verify -L then nothing shows up, I also did a:
> > bzcat /var/lib/heartbeat/pengine/pe-warn-8072.bz2 | crm_verify -p
> >
> > this command also produced no output.
> >
> > I am in a two node cluster, where one node is stopped, maybe that is  
> > the
> > reason?
> > What else could I do to figure out what the cluster thinks that a  
> > problem
> > is.
> 
> some warnings can only be determined when doing a full simulation (ie.  
> like ptest does)
> unfortunately crm_verify doesn't always have the status section and so  
> can't do a full simulation.
> 
> though when called with -L it would... i'll fix that for the next  
> version
> 
Ah, that's great, thank you.

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] question regarding quorumd

2007-11-13 Thread Sebastian Reitenbach
Hi,

Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> 
> On Nov 13, 2007, at 11:13 AM, Sebastian Reitenbach wrote:
> 
> > Hi,
> >
> > Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> >>
> >> On Nov 9, 2007, at 4:34 PM, Sebastian Reitenbach wrote:
> >>
> >>> Hi,
> >>>
> >>> I did some tests with a two node cluster and a third one running a
> >>> quorumd.
> >>>
> >>> I started the quorumd, and then the two cluster nodes.
> >>> The one that became DC, started to communicate with the remote
> >>> quorumd.
> >>
> >> The CRM (and thus the "DC") doesn't know anything about quorumd
> >> I believe this is purely the domain of the CCM and I've no idea how
> >> that works :-)
> >>
> >> We just consume membership data from it...
> >>
> >> So anyway, my point is that the fact that a node is the DC is
> >> irrelevant when it comes to quorumd.
> > but somehow the cluster knows, as only the DC is communicating with  
> > the
> > external quorumd.
> 
> I think that its just a co-incidence that it happens to be the DC...  
> at least I hope it is.
I thought I read somewhere, that the DC is the one in charge of 
communicating with the remote quorumd, but I may be wrong here.

> 
> > I just do not understand, why the cluster does not retry
> > to re-contact the quorumd after it lost connection to it. This was  
> > what I
> > assumed, after a disconnect to the remote quorumd, the cluster nodes  
> > should
> > try to contact it, and when the contact is there again, use it again.
> 
> I agree - but I've never seen that code.  You'll have to contact alan  
> or file a bug for him.
Alan, in case you think this is a bug, I'll go create a bug report for it.
Please let me know.

> 
> >>> I killed the DC, saw the other becoming DC, and start communicating
> >>> to the remote quorumd, all fine, cluster still with quorum.
> >>> Then I killed the quorumd itself, the DC recognized, and started to
> >>> stop
> >>> all resource, because of the quorum_policy, as it lost quorum.
> >>>
> >>> Then I restarted the quorumd again, but the DC, still without  
> >>> quorum,
> >>> did not tried to communicate to the quorumd again.
> >>> I'd expect the still living DC to try to contact the quorumd, in
> >>> case it
> >>> comes back.
> >>>
> >>> If there is a good reason, why the DC is not trying to reconnect to
> >>> the
> >>> remote quorumd I'd really like to get enlightened from someone who
> >>> knows.
> >>>
kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] question regarding quorumd

2007-11-13 Thread Sebastian Reitenbach
Hi,

Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> 
> On Nov 9, 2007, at 4:34 PM, Sebastian Reitenbach wrote:
> 
> > Hi,
> >
> > I did some tests with a two node cluster and a third one running a  
> > quorumd.
> >
> > I started the quorumd, and then the two cluster nodes.
> > The one that became DC, started to communicate with the remote  
> > quorumd.
> 
> The CRM (and thus the "DC") doesn't know anything about quorumd
> I believe this is purely the domain of the CCM and I've no idea how  
> that works :-)
> 
> We just consume membership data from it...
> 
> So anyway, my point is that the fact that a node is the DC is  
> irrelevant when it comes to quorumd.
but somehow the cluster knows, as only the DC is communicating with the 
external quorumd. I just do not understand, why the cluster does not retry 
to re-contact the quorumd after it lost connection to it. This was what I 
assumed, after a disconnect to the remote quorumd, the cluster nodes should
try to contact it, and when the contact is there again, use it again.

kind regards
Sebastian



> 
> >
> > I killed the DC, saw the other becoming DC, and start communicating
> > to the remote quorumd, all fine, cluster still with quorum.
> > Then I killed the quorumd itself, the DC recognized, and started to  
> > stop
> > all resource, because of the quorum_policy, as it lost quorum.
> >
> > Then I restarted the quorumd again, but the DC, still without quorum,
> > did not tried to communicate to the quorumd again.
> > I'd expect the still living DC to try to contact the quorumd, in  
> > case it
> > comes back.
> >
> > If there is a good reason, why the DC is not trying to reconnect to  
> > the
> > remote quorumd I'd really like to get enlightened from someone who  
> > knows.
> >
> > kind regards
> > Sebastian
> >
> > ___
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> 
> 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] log warnings, but when I check no error seems to be there

2007-11-13 Thread Sebastian Reitenbach
Hi,

I see a lot of these messages in my logfile:

pengine[12757]: 2007/11/13_10:27:02 WARN: process_pe_message: Transition 
7687: WARNINGs found during PE processing. PEngine Input stored 
in: /var/lib/heartbeat/pengine/pe-warn-8072.bz2
pengine[12757]: 2007/11/13_10:27:02 info: process_pe_message: Configuration 
WARNINGs found during PE processing.  Please run "crm_verify -L" to identify 
issues.

but when I check crm_verify -L then nothing shows up, I also did a:
bzcat /var/lib/heartbeat/pengine/pe-warn-8072.bz2 | crm_verify -p

this command also produced no output.

I am in a two node cluster, where one node is stopped, maybe that is the 
reason?
What else could I do to figure out what the cluster thinks that a problem 
is.

I am using heartbeat 2.1.2-4.1 on opensuse 10.2 x86_64

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pingd removed transient attr from node attributes after short network outage, and did not recreated it

2007-11-12 Thread Sebastian Reitenbach
Hi,

> 
> This is what happened: Due to a membership issue which has been
> only recently resolved, the crmd/cib combo would jointly leave
> the cluster. Other cib clients where supposed to follow in order
> to be started again by the master process and then connect to the
> new cib instance. But attrd doesn't have such a feature. There's
> now a bugzilla for that:
> 
> http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1776
> 
> In the meantime, you could try with a newer heartbeat version.
Thanks, I'll retest when the next interim version is out.

> 
> Thanks for the report.
> 
No problem, I only wanted to know whether I am right, or the cluster ;)

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] pingd removed transient attr from node attributes after short network outage, and did not recreated it

2007-11-12 Thread Sebastian Reitenbach
Hi,

I did some more tests with my two node cluster, regarding pingd.

I started the two node cluster. Both nodes came up, resources are
distributed as the location constraints define it. The location of the
Xen resources are dependent on pingd attributes.
Then on the only one ping node, I flushed the state tables, and only
allowed pings from the host ppsdb101. I saw the
Xen resources moving, everything great. I changed the Firwall
on the Ping node to only allow pings from the ppsnfs101 host. Well,
all four Xen resources moved over to the ppsnfs101 host.
At 16:17 the I disabled the both ports of the switch where the nodes
are connected, e.g. a real life usecase would be:
1. non redundant netork layout
2. no stonith, or stonith over network (e.g. ilo or ssh)
3. someone removes power from the switch where both nodes are connected

Then I waited about 10 seconds, and enabled both ports again.
The RSTP took some more seconds to restructure.

After that both nodes could communicate again with each other, and
the pings are reaching the ping node again, the lines that the pingd
produces as transient attributes to the nodes, were both gone.

Before I removed the cable, I issued a
cibadmin -Q -o status | grep ping
and the two lines, one for each host, showed up, after disconnecting
both hosts, and reconnecting, rerunning the cibadmin command,
showed me, both attr lines were gone. I did wait for about 5-10 minutes
but it did not came back. I did that several times, with one or the
other node or both being able to ping the ping node before
disabling the switch ports.

I expected the transient pingd attributes that the nodes had, 
A) not to disappear, but only get resetted to 0
B) In case it is ok that they disappeared, I expected them to come back, 
when they are receiving echo replies from the ping node again.

But maybe I am still missing sth or misunderstood. Who is right, me or the 
cluster?

output of hb_report is attached. 

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] problem with locations depending on pingd

2007-11-10 Thread Sebastian Reitenbach
Hi,

> > 
> > Looks like heartbeat didn't notice the ping node went away.
> > If that doesn't happen, then the score wouldn't change.
> > 
> > Are you sure you made the right change?
> 100% sure, I tested it several times. Started the ping node with allowing 
> pings from say node A, but not node B, made sure with manual ping. Then 
> started the cluster, and I saw all resources starting on A. Then 
> reconfiguring the firwall on the ping node to answer pings from A and B, 
> no need to check that it works, I just saw some of the resources 
> migrating... Up to that point everything was as I expected. Then I could
> reconfigure the firewall on the ping node to not answer pings from either 
A 
> or B anymore, but the value of pingd in the node attributes was not reset 
to 
> 0. This is what I observed. Well, while writing this, I did not fired up 
> tcpdump to see whether the answers really stopped, maybe the ping node 
kept 
> track of some states? But I manually pinged the ping node from the cluster 
> node that I disabled, and I did not got an answer.
> 
dumb user error on my side. After starting tcpdump on the ping node, and 
reconfiguring the firewall, I saw it was as I thought, 
the firewall was too smart for me ;) 
After flushing the state tables, the firewall stopped answering the pings, 
and the attribute got reset, so everything works now as expected.

sorry for the noise

thanks
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] problem with locations depending on pingd

2007-11-09 Thread Sebastian Reitenbach
Hi,

Andrew Beekhof <[EMAIL PROTECTED]> wrote: 
> 
> On Nov 9, 2007, at 1:44 PM, Sebastian Reitenbach wrote:
> 
> > Hi,
> >
> > I changed the resources to look like this:
> >
> >   
> > 
> >> operation="not_defined"/>
> >> operation="lte"
> > value="0"/>
> > 
> > 
> >> operation="eq"
> > value="ppsnfs101"/>
> > 
> >   
> >
> >
> > It seems to work well on startup, but I still have the same problem  
> > that the
> > attribute that the pingd sets is not reset to 0 when pingd stops  
> > receiving
> > ping answers from the ping node.
> 
> Looks like heartbeat didn't notice the ping node went away.
> If that doesn't happen, then the score wouldn't change.
> 
> Are you sure you made the right change?
100% sure, I tested it several times. Started the ping node with allowing 
pings from say node A, but not node B, made sure with manual ping. Then 
started the cluster, and I saw all resources starting on A. Then 
reconfiguring the firwall on the ping node to answer pings from A and B, 
no need to check that it works, I just saw some of the resources 
migrating... Up to that point everything was as I expected. Then I could
reconfigure the firewall on the ping node to not answer pings from either A 
or B anymore, but the value of pingd in the node attributes was not reset to 
0. This is what I observed. Well, while writing this, I did not fired up 
tcpdump to see whether the answers really stopped, maybe the ping node kept 
track of some states? But I manually pinged the ping node from the cluster 
node that I disabled, and I did not got an answer.

Sebastian




Sebastian

> 
> >
> > I created a bugzilla entry, with a hb_report appended:
> > http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi? 
> > id=1770
> >
> > kind regards
> > Sebastian
> >
> >
> > Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA
> > mailing list  wrote:
> >> Hi Dejan,
> >>
> >> thank you very much for your helpful hints, I got it mostly  
> >> working. I
> >> initially generated the constraints via the GUI, and did not  
> >> recognized
> > the
> >> subtle differences.I changed them manually to look like what you
> > suggested,
> >> in your first example. I have to admit, I did not tried yet the - 
> >> INFINITY
> >> example you gave, where the resources will refuse to work on a node
> > without
> >> connectivity. Because I think it would not work, when I see my
> > observations:
> >>
> >> In the beginning, after cluster startup, node
> >> 262387d6-3ba0-4001-95c6-f394d1ba243f
> >> is not able to ping, node 15854123-86ef-46bb-bf95-79c99fb62f46 is  
> >> able to
> >> ping
> >> the defined ping node.
> >> cibadmin -Q -o status | grep ping
> >>  >> provider="heartbeat">
> >>  >> provider="heartbeat">
> >>>> name="pingd" value="0"/>
> >>  >> provider="heartbeat">
> >>  >> provider="heartbeat">
> >>>> name="pingd" value="100"/>
> >>
> >> then, all four resources are on host 15854123-86ef-46bb- 
> >> bf95-79c99fb62f46,
> >> so everything as I expected.
> >> then, I changed the firewall to not answer pings from
> >> 15854123-86ef-46bb-bf95-79c99fb62f46
> >> but instead answer pings from: 262387d6-3ba0-4001-95c6- 
> >> f394d1ba243f, then
> >> it took some seconds, and the output changed to:
> >>
> >> cibadmin -Q -o status | grep ping
> >>>> name="pingd" value="100"/>
> >>  >> provider="heartbeat">
> >>  >> provider="heartbeat">
> >>>> name="pingd" value="100"/>
> >>  >> provider="heartbeat">
> >>  >> provider="heartbeat">
> >>
> >> and two of the resources went over to the node
> >> 262387d6-3ba0-4001-95c6-f394d1ba243f.
> >>
> >> but also after some more minutes, the output of cibadmin -Q -o  
> >> status |
> > grep
> >> ping
> >> did not changed again. Id expected it to look like this:
> >>>> n

[Linux-HA] question regarding quorumd

2007-11-09 Thread Sebastian Reitenbach
Hi,

I did some tests with a two node cluster and a third one running a quorumd.

I started the quorumd, and then the two cluster nodes. 
The one that became DC, started to communicate with the remote quorumd.
I killed the DC, saw the other becoming DC, and start communicating
to the remote quorumd, all fine, cluster still with quorum.

Then I killed the quorumd itself, the DC recognized, and started to stop
all resource, because of the quorum_policy, as it lost quorum.

Then I restarted the quorumd again, but the DC, still without quorum,
did not tried to communicate to the quorumd again.
I'd expect the still living DC to try to contact the quorumd, in case it
comes back.

If there is a good reason, why the DC is not trying to reconnect to the
remote quorumd I'd really like to get enlightened from someone who knows.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] question regarding to quorumd

2007-11-09 Thread Sebastian Reitenbach
Hi,

Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA 
mailing list  wrote: 
> Hi,
> 
> here: http://www.linux-ha.org/QuorumServerGuide
> I read that for the /etc/ha.d/quorumd.conf the version has to be:
> the version of the protocol between the quorum server and its clients 
(2_0_8 
> is the only version supported now)
> 
> Is this still true for newer version of heartbeat too, e.g. I use 
heartbeat 
> 2.1.2, but maybe the quorum protocol version is still the same?
> 
I think I can answer the question myself, I found the 
file: /usr/lib64/heartbeat/plugins/quorumd/2_0_8.so.
So I assume, it is still the version 2_0_8.

sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] problem with locations depending on pingd

2007-11-09 Thread Sebastian Reitenbach
Hi,

I changed the resources to look like this:

   
 
   
   
 
 
   
 
   


It seems to work well on startup, but I still have the same problem that the 
attribute that the pingd sets is not reset to 0 when pingd stops receiving 
ping answers from the ping node.
I created a bugzilla entry, with a hb_report appended:
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1770

kind regards
Sebastian


Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA 
mailing list  wrote: 
> Hi Dejan,
> 
> thank you very much for your helpful hints, I got it mostly working. I 
> initially generated the constraints via the GUI, and did not recognized 
the 
> subtle differences.I changed them manually to look like what you 
suggested, 
> in your first example. I have to admit, I did not tried yet the -INFINITY 
> example you gave, where the resources will refuse to work on a node 
without 
> connectivity. Because I think it would not work, when I see my 
observations:
> 
> In the beginning, after cluster startup, node 
> 262387d6-3ba0-4001-95c6-f394d1ba243f
> is not able to ping, node 15854123-86ef-46bb-bf95-79c99fb62f46 is able to 
> ping 
> the defined ping node.
> cibadmin -Q -o status | grep ping
>   provider="heartbeat">
>   provider="heartbeat">
> name="pingd" value="0"/>
>   provider="heartbeat">
>   provider="heartbeat">
> name="pingd" value="100"/>
> 
> then, all four resources are on host 15854123-86ef-46bb-bf95-79c99fb62f46,
> so everything as I expected.
> then, I changed the firewall to not answer pings from 
> 15854123-86ef-46bb-bf95-79c99fb62f46
> but instead answer pings from: 262387d6-3ba0-4001-95c6-f394d1ba243f, then 
> it took some seconds, and the output changed to:
> 
> cibadmin -Q -o status | grep ping
> name="pingd" value="100"/>
>   provider="heartbeat">
>   provider="heartbeat">
> name="pingd" value="100"/>
>   provider="heartbeat">
>   provider="heartbeat">
> 
> and two of the resources went over to the node 
> 262387d6-3ba0-4001-95c6-f394d1ba243f.
> 
> but also after some more minutes, the output of cibadmin -Q -o status | 
grep 
> ping
> did not changed again. Id expected it to look like this:
> name="pingd" value="100"/>
>   provider="heartbeat">
>   provider="heartbeat">
> name="pingd" value="0"/>
>   provider="heartbeat">
>   provider="heartbeat">
> and that the two resources from 15854123-86ef-46bb-bf95-79c99fb62f46 would 
> migrate to 
> node 262387d6-3ba0-4001-95c6-f394d1ba243f
> 
> My assumption is, that the -INFINITY example would only work, when the 
value 
> for the id
> status-15854123-86ef-46bb-bf95-79c99fb62f46-pingd would be resetted to 0 
at 
> some 
> point, but it is not. Therefore I did not tried.
> 
> 
> below are my constraints, the ping clone resource, and an exemplary Xen 
> resource.
> 
>  
> to="MGMT_DB" action="start" symmetrical="false" score="0"/>
> to="NFS_MH" action="start" symmetrical="false" score="0"/>
> to="NFS_SW" action="start" symmetrical="false" score="0"/>
> to="NFS_SW" action="start" symmetrical="false" score="0"/>
> to="NFS_MH" action="start" symmetrical="false" score="0"/>
> to="NFS_SW" action="start" symmetrical="false" score="0"/>
>
>  
> id="e248586f-284b-4d6e-86a1-86ac54cecb3d" operation="defined"/>
>  
>
>
>  
> id="ccd4c85c-7b30-48c5-806e-d37a42e3db5b" operation="defined"/>
>  
>
>
>  
> id="ff209e83-ac2e-4dad-901b-f6496c652f3b" operation="defined"/>
>  
>
>
>  
> id="4349f298-2f36-4bfa-9318-ed9863ab32bb" operation="defined"/>
>  
>
>  
> 
> 
> 
>
>  
>
>   value="started"/>
>   value="2"/>
>   name="clone_node_max" value="1"/>
>   name="globally_unique" value="false"/&

[Linux-HA] question regarding to quorumd

2007-11-09 Thread Sebastian Reitenbach
Hi,

here: http://www.linux-ha.org/QuorumServerGuide
I read that for the /etc/ha.d/quorumd.conf the version has to be:
the version of the protocol between the quorum server and its clients (2_0_8 
is the only version supported now)

Is this still true for newer version of heartbeat too, e.g. I use heartbeat 
2.1.2, but maybe the quorum protocol version is still the same?

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] observations after some fencing tests in a two node

2007-11-08 Thread Sebastian Reitenbach
Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA 
mailing list  wrote: 
> Hi,
> Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> > Hi,
> > 
> > On Wed, Nov 07, 2007 at 04:43:32PM +0100, Sebastian Reitenbach wrote:
> > > Hi all,
> > > 
> > > I did some fencing tests in a two node cluster, here are some details 
of 
> my 
> > > setup:
> > > 
> > > - use stonith external/ilo for fencing (ssh to ilo board and issue a 
> reset 
> > > command)
> > > - both nodes are connected via two bridged ethernet interfaces to two 
> > > redundant switches. The ilo boards are connected to the each of the 
> > > switches.
> > > 
> > > My first observation:
> > > - when removing the network cables from the node that is the DC at the 
> > > moment, it took at least three minutes, until it decided to stonith 
the 
> > > other node and to startup the resources that ran on the node without 
> network 
> > > connectivity
> > > - when removing the network cables from the node that is not the DC, 
> then it 
> > > was a matter of e.g. 20 seconds, then this node fenced the DC, and 
then 
> > > became DC
> > 
> > This definitely deserves a set of logs, etc (is your hb_report
> > operational? :).
> humm, yes, with the latest patches (:
> ok, I'll reproduce the problem and create a report.

I seem to be unable to reproduce the problem, that day when it happened, 
there must have been some orphaned actions/resources whatever in the way 
that meanwhile disappeared.

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] problem with locations depending on pingd

2007-11-08 Thread Sebastian Reitenbach
Hi Dejan,

thank you very much for your helpful hints, I got it mostly working. I 
initially generated the constraints via the GUI, and did not recognized the 
subtle differences.I changed them manually to look like what you suggested, 
in your first example. I have to admit, I did not tried yet the -INFINITY 
example you gave, where the resources will refuse to work on a node without 
connectivity. Because I think it would not work, when I see my observations:

In the beginning, after cluster startup, node 
262387d6-3ba0-4001-95c6-f394d1ba243f
is not able to ping, node 15854123-86ef-46bb-bf95-79c99fb62f46 is able to 
ping 
the defined ping node.
cibadmin -Q -o status | grep ping
 
 
   
 
 
   

then, all four resources are on host 15854123-86ef-46bb-bf95-79c99fb62f46,
so everything as I expected.
then, I changed the firewall to not answer pings from 
15854123-86ef-46bb-bf95-79c99fb62f46
but instead answer pings from: 262387d6-3ba0-4001-95c6-f394d1ba243f, then 
it took some seconds, and the output changed to:

cibadmin -Q -o status | grep ping
   
 
 
   
 
 

and two of the resources went over to the node 
262387d6-3ba0-4001-95c6-f394d1ba243f.

but also after some more minutes, the output of cibadmin -Q -o status | grep 
ping
did not changed again. Id expected it to look like this:
   
 
 
   
 
 
and that the two resources from 15854123-86ef-46bb-bf95-79c99fb62f46 would 
migrate to 
node 262387d6-3ba0-4001-95c6-f394d1ba243f

My assumption is, that the -INFINITY example would only work, when the value 
for the id
status-15854123-86ef-46bb-bf95-79c99fb62f46-pingd would be resetted to 0 at 
some 
point, but it is not. Therefore I did not tried.


below are my constraints, the ping clone resource, and an exemplary Xen 
resource.

 
   
   
   
   
   
   
   
 
   
 
   
   
 
   
 
   
   
 
   
 
   
   
 
   
 
   
 



   
 
   
 
 
 
 
   
 
 
   
 
   
   
   
   
   
   
 
   
 
   



  
 
   
 
 
 
 
   
 
 
   
 
   
 
 
   
 
   



kind regards
Sebastian

Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> Hi,
> 
> On Wed, Nov 07, 2007 at 06:31:54PM +0100, Sebastian Reitenbach wrote:
> > Hi,
> > 
> > I tried to follow http://www.linux-ha.org/pingd, the section
> > "Quickstart - Only Run my_resource on Nodes with Access to at Least One 
Ping 
> > Node"
> > 
> > therefore I have created the following pingd resources:
> > 
> >
> 
>  
> 
> because all the clones will be equal.
> 
> >  
> >
> >   > value="started"/>
> >   > value="2"/>
> >   > name="clone_node_max" value="1"/>
> >
> >  
> >  
> >
> >  
> > > value="/tmp/PING.pid"/>
> > > value="root"/>
> > > name="host_list" value="192.168.102.199"/>
> > > value="pingd"/>
> 
> add these two
>   
>   
> 
> >  
> >
> >  
> >
> > 
> > 
> > and here is my location constraint (entered via hb_gui, thererfore is a 
> > value there):
> > 
> >
> >  
> > > id="4349f298-2f36-4bfa-9318-ed9863ab32bb" operation="defined" 
value="af"/>
> >  
> 
> Looks somewhat strange. There are quite a few better examples on
> the page you quoted:
> 
> 
> 
>attribute="pingd" operation="defined"/>
>   
> 
> 
> or, perhaps better:
> 
> 
>   
>   
>   
>   
> 
> 
> The latter will have a score of -INFINITY for all nodes which
> don't have an attribute or it's value is zero thus preventing the
> resource from running there.
> 
> > The 192.168.102.199 is just an openbsd host, pingable from both cluster 
> > nodes. The NFS_MH resource is a Xen domU.
> > On startup of the two cluster nodes, the NFS_MH node went to node1.
> > Then I reconfigured the firewall of the ping node to only answer 
> > pings from node2. 
> > In the cluster itself, nothing happened, but I expected the resource to 
> > relocate to the node with connectivity. I still must do sth. wrong I 
think, 
> > any hints?
> 
>

Re: [Linux-HA] Heartbeat lrmd is core dumping

2007-11-08 Thread Sebastian Reitenbach
General Linux-HA mailing list  wrote: 
> Along these lines, do you have to take the entire cluster down to do an
> upgrade of heartbeat (2.0.8 to 2.1.2), or can you take one node down,
> upgrade it, bring it back in the cluster, take down the other, etc?
The second way should work as you described.

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] observations after some fencing tests in a two node

2007-11-08 Thread Sebastian Reitenbach
Hi,
Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> Hi,
> 
> On Wed, Nov 07, 2007 at 04:43:32PM +0100, Sebastian Reitenbach wrote:
> > Hi all,
> > 
> > I did some fencing tests in a two node cluster, here are some details of 
my 
> > setup:
> > 
> > - use stonith external/ilo for fencing (ssh to ilo board and issue a 
reset 
> > command)
> > - both nodes are connected via two bridged ethernet interfaces to two 
> > redundant switches. The ilo boards are connected to the each of the 
> > switches.
> > 
> > My first observation:
> > - when removing the network cables from the node that is the DC at the 
> > moment, it took at least three minutes, until it decided to stonith the 
> > other node and to startup the resources that ran on the node without 
network 
> > connectivity
> > - when removing the network cables from the node that is not the DC, 
then it 
> > was a matter of e.g. 20 seconds, then this node fenced the DC, and then 
> > became DC
> 
> This definitely deserves a set of logs, etc (is your hb_report
> operational? :).
humm, yes, with the latest patches (:
ok, I'll reproduce the problem and create a report.

> 
> > Why is there such a difference? The first one takes too long in my eyes 
to 
> > detect the outage, but I hope there are timeout values that I can tweak. 
For 
> > which ones shall I take a look?
> 
> deadtime in ha.cf.
> 
> > Also I recognized the following line in the logfile from the DC in the 
first 
> > case:
> > tengine: ... info: extract_event: Stonith/shutdown of  not matched
> > This line shows up immediately after the DC detects that the other node 
is 
> > unreachable. From then it takes at least two minutes until the DC 
decides to 
> > fence the other node.
> 
> Looks like a kind of misunderstanding between the CRM and
> stonithd. Again, a report would hopefully reveal what's going on.
> If you could turn debug on, that'd be great. A bugzilla is
> fine too.
I'll do, with above logs attached.
> 
> > The second thing I observed:
> > My stonith is working via ssh to the ilo board to the node that shall be 
> > fenced. When I remove the ethernet cables from one node, stonith will 
fail 
> > to kill the other node.
> > 
> > take case two from above, remove the cables from the node that is not 
the 
> > DC, where I observed the following:
> > The DC needs about some minutes to decide to fence the other node, 
because 
> > of the above observed behaviour. Meanwhile the non DC node without 
network 
> > cables tried to fence the DC, that failed, and the node was in a unclean 
> > state, until the DC fenced it in the end. 
> > Luckily the stonith of the DC failed, then assume instead of ssh as 
stonith 
> > resource, use a stonith devied connected to e.g. serial port.
> > In that case, the non DC node were able to fence the DC, and then become 
DC 
> > itself, starting all resources, mounting all filesystems, ...
> > Meanwhile the DC is restarted, and either heartbeat is not started 
> > automatically, then the cluster is unusable, because the one node that 
is DC 
> > has no network. Or when heartbeat is started automatically, it cannot 
> > communicate to the second node, and will assume this one is dead,
> 
> and will insist on reseting it. Which would result in a yo-yo
> machinery. Not entirely useful. This kind of lack of
> communication is obviously detrimental, and that in spite of the
> stonith configured. Right now don't see a solution to this issue.
> Apart from pingd.
> 
> > and start 
> > all its resources, so that e.g. filesystems could be mounted on both 
nodes.
> > 
> > I don't have a hardware fencing device to test my theory, but could that 
> > happen or not? Could the usage of some ping nodes, combined with a pingd 
or 
> > an external quorumd help to solve the dilemma?
> 
> A pingd resource with appropriate constraints would help, i.e.
> something like "don't run resources if the pingd attribute is
> zero".
I am already fiddling around with ping, but doesn't seem to get it to work, 
see the other thread: "problem with locations depending on pingd"



> 
> > Well, I am running heartbeat 2.1.2-15 on sles10sp1, any hints and 
comments 
> > are appreciated.
> 
> Thanks,
> 
> Dejan
thank you,

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] problem with locations depending on pingd

2007-11-08 Thread Sebastian Reitenbach
Hi,

Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA 
mailing list  wrote: 
> Hi,
> 
> I tried to follow http://www.linux-ha.org/pingd, the section
> "Quickstart - Only Run my_resource on Nodes with Access to at Least One 
Ping 
> Node"
> 
> therefore I have created the following pingd resources:
> 
>
>  
>
>   value="started"/>
>   value="2"/>
>   name="clone_node_max" value="1"/>
>
>  
>  
>
>  
> value="/tmp/PING.pid"/>
> value="root"/>
> name="host_list" value="192.168.102.199"/>
> value="pingd"/>
>  
>
>  
>
> 
> 
> and here is my location constraint (entered via hb_gui, thererfore is a 
> value there):
> 
>
>  
> id="4349f298-2f36-4bfa-9318-ed9863ab32bb" operation="defined" value="af"/>
>  
> 
> 
> The 192.168.102.199 is just an openbsd host, pingable from both cluster 
> nodes. The NFS_MH resource is a Xen domU.
> On startup of the two cluster nodes, the NFS_MH node went to node1.
> Then I reconfigured the firewall of the ping node to only answer 
> pings from node2. 
> In the cluster itself, nothing happened, but I expected the resource to 
> relocate to the node with connectivity. I still must do sth. wrong I 
think, 
> any hints?

I am still fiddling around to get the location based on ping connectivity 
working, I changed the location score from 100 to INFINITY.

When the pingd resource is started, I see the following 
in /var/log/messages:

ov  8 11:48:17 ppsnfs101 pingd: [16543]: info: do_node_walk: Requesting the 
list of configured nodes
Nov  8 11:48:18 ppsnfs101 pingd: [16543]: info: send_update: 1 active ping 
nodes
Nov  8 11:48:18 ppsnfs101 pingd: [16543]: info: main: Starting pingd
Nov  8 11:55:24 ppsnfs101 pingd: [21205]: info: 
Invoked: /usr/lib64/heartbeat/pingd -a pingd -d 1s -h 192.168.102.199

and this shows up, when I take a look at the process list:
 6498 ?SL 0:00 heartbeat: write: ping 192.168.102.199
 6499 ?SL 0:00 heartbeat: read: ping 192.168.102.199
16543 ?S  0:00 /usr/lib64/heartbeat/pingd -D -p /tmp/PING.pid -a 
pingd -d 1s -h 192.168.102.199
21709 pts/0S+ 0:00 grep ping

So the ping node was reachable from both cluster nodes on startup, therefore 
the resources started up on one of the hosts. I then changed the firewall 
rules in the ping node, to only answer pings where the resource is not 
running on it, but nothing happend. no new ping related output 
in /var/log/messages, nor the ha-log files, on both hosts. I expected a note 
in one of the logfiles, that ping is not working anymore, and the resource 
would migrate to the host with connectivity.

I also had both cluster nodes in standby, changed the firewall on the ping 
node to only answer pings from one cluster node. Then started both nodes, 
and I saw the resource starting on the cluster node which had no 
connectivity.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] observations after some fencing tests in a twonode

2007-11-07 Thread Sebastian Reitenbach
"matilda matilda" <[EMAIL PROTECTED]> wrote: 
> >>> "Sebastian Reitenbach" <[EMAIL PROTECTED]> 07.11.2007 
16:43 >>>
> 
> >The second thing I observed:
> >My stonith is working via ssh to the ilo board to the node that shall be 
> >fenced. When I remove the ethernet cables from one node, stonith will 
fail 
> >to kill the other node.
> 
> Hi Sebastian,
> 
> the answers to your questions will be interesting. :-)
> 
> One additional question by me.
> How did you set up the stonith device? External stonith plugin?
> Where does this stonith resource run? One stonith resource for
> both nodes or one for each?
I have a cloned stonith resource, it is external/ssh, so every node can 
fence any other node via its ilo board.
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1649
I created it some time ago, the version there is not the latest one, I just 
stumbled over a bug today, while testing, that I fixed, but not yet uploaded 
again.

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] problem with locations depending on pingd

2007-11-07 Thread Sebastian Reitenbach
Hi,

I tried to follow http://www.linux-ha.org/pingd, the section
"Quickstart - Only Run my_resource on Nodes with Access to at Least One Ping 
Node"

therefore I have created the following pingd resources:

   
 
   
 
 
 
   
 
 
   
 
   
   
   
   
 
   
 
   


and here is my location constraint (entered via hb_gui, thererfore is a 
value there):

   
 
   
 


The 192.168.102.199 is just an openbsd host, pingable from both cluster 
nodes. The NFS_MH resource is a Xen domU.
On startup of the two cluster nodes, the NFS_MH node went to node1.
Then I reconfigured the firewall of the ping node to only answer 
pings from node2. 
In the cluster itself, nothing happened, but I expected the resource to 
relocate to the node with connectivity. I still must do sth. wrong I think, 
any hints?


kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] observations after some fencing tests in a two node

2007-11-07 Thread Sebastian Reitenbach
Hi all,

I did some fencing tests in a two node cluster, here are some details of my 
setup:

- use stonith external/ilo for fencing (ssh to ilo board and issue a reset 
command)
- both nodes are connected via two bridged ethernet interfaces to two 
redundant switches. The ilo boards are connected to the each of the 
switches.

My first observation:
- when removing the network cables from the node that is the DC at the 
moment, it took at least three minutes, until it decided to stonith the 
other node and to startup the resources that ran on the node without network 
connectivity
- when removing the network cables from the node that is not the DC, then it 
was a matter of e.g. 20 seconds, then this node fenced the DC, and then 
became DC

Why is there such a difference? The first one takes too long in my eyes to 
detect the outage, but I hope there are timeout values that I can tweak. For 
which ones shall I take a look?

Also I recognized the following line in the logfile from the DC in the first 
case:
tengine: ... info: extract_event: Stonith/shutdown of  not matched
This line shows up immediately after the DC detects that the other node is 
unreachable. From then it takes at least two minutes until the DC decides to 
fence the other node.


The second thing I observed:
My stonith is working via ssh to the ilo board to the node that shall be 
fenced. When I remove the ethernet cables from one node, stonith will fail 
to kill the other node.

take case two from above, remove the cables from the node that is not the 
DC, where I observed the following:
The DC needs about some minutes to decide to fence the other node, because 
of the above observed behaviour. Meanwhile the non DC node without network 
cables tried to fence the DC, that failed, and the node was in a unclean 
state, until the DC fenced it in the end. 
Luckily the stonith of the DC failed, then assume instead of ssh as stonith 
resource, use a stonith devied connected to e.g. serial port.
In that case, the non DC node were able to fence the DC, and then become DC 
itself, starting all resources, mounting all filesystems, ...
Meanwhile the DC is restarted, and either heartbeat is not started 
automatically, then the cluster is unusable, because the one node that is DC 
has no network. Or when heartbeat is started automatically, it cannot 
communicate to the second node, and will assume this one is dead, and start 
all its resources, so that e.g. filesystems could be mounted on both nodes.

I don't have a hardware fencing device to test my theory, but could that 
happen or not? Could the usage of some ping nodes, combined with a pingd or 
an external quorumd help to solve the dilemma?

Well, I am running heartbeat 2.1.2-15 on sles10sp1, any hints and comments 
are appreciated.

kind regards
Sebastian



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] best practices monitoring services in Xen instances

2007-11-06 Thread Sebastian Reitenbach
Hi Andrew,

"Andrew Beekhof" <[EMAIL PROTECTED]> wrote: 
> On 11/5/07, Sebastian Reitenbach <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > to remove complexity from my cluster, I am experimenting with Xen.
> > Starting and stopping the Xen resources via heartbeat works well 
already.
> > I am a bit concerned about the services in the virtual machines, how is 
the
> > best approach to monitor their availability?
> 
> what you're talking about is basically having the crm manage resources
> on non-cluster nodes.
> 
> we've kicked around some ideas for implementing this in the past but
> its never really bubbled to the top of anyone's todo list.
> 
> there's not really any "best practices" for this as its not really
> being done a whole lot (from what I hear anyway).  depending on how
> complex the relationships between the resources inside the Xen guests
> are, i'd go with option 1 (if they're complex) or 2 (if not)

thank you for your comments. I more or less have to check that the services 
not get killed by the OOM killer, e.g. when i have 3 domU's running, and I 
want to start a 4. node, but I have no free memory, available, then I have 
to shrink the memory of the already running domU's via xen'S mem-set.
But when I do that, it can happen that the OOM killer in the domU will kill 
my services, that the domU is intended to provide. Unfortunately, heartbeat 
has nothing to detect that, yet.
I am just tweaking the Xen resource script. I added a parameter, 
OCF_RESKEY_monitor_scripts, that the Xen resource script will run when the 
monitor action for the domU is called. These custom scripts will test the 
services assigned to the domU, in case one fails, then the whole domU will 
be restarted via heartbeat, and then hopefully get the internal service 
restarted too.

Sebastian
> 
> >
> > I have some solutions, but would like to know what corresponds to best
> > practice:
> >
> > - install heartbeat in the virtual domains too, then monitor the 
resources
> > within the xen instance, but I think this is counterproductive as I 
wanted
> > to remove complexity from the cluster due to having less resources.
> >
> > - monitor the services in the virtual domains using SNMP, or custom 
scripts,
> > and in case sth. fails, crm_resource stop and setart it again. Well, 
custom
> > scripts sounds a bit error prone.
> >
> > - I don't know whether xen has the ability, but does the priviliged 
domain
> > has the ability to query a given domU for the state of a process, and in
> > case the state is just not the wanted one, restart the domU.
> >
> > I think the last one, would be the best, but I have no idea, whether xen 
can
> > do that at all. I played around with OpenVZ for a short while, that at 
least
> > could do that. Any other ideas, comments, rants are very welcome.
> >
> > kind regards
> > Sebastian
> >
> > ___
> > Linux-HA mailing list
> > Linux-HA@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] best practices monitoring services in Xen instances

2007-11-05 Thread Sebastian Reitenbach
Hi,

to remove complexity from my cluster, I am experimenting with Xen.
Starting and stopping the Xen resources via heartbeat works well already. 
I am a bit concerned about the services in the virtual machines, how is the 
best approach to monitor their availability?

I have some solutions, but would like to know what corresponds to best 
practice:

- install heartbeat in the virtual domains too, then monitor the resources 
within the xen instance, but I think this is counterproductive as I wanted 
to remove complexity from the cluster due to having less resources.

- monitor the services in the virtual domains using SNMP, or custom scripts, 
and in case sth. fails, crm_resource stop and setart it again. Well, custom 
scripts sounds a bit error prone.

- I don't know whether xen has the ability, but does the priviliged domain 
has the ability to query a given domU for the state of a process, and in 
case the state is just not the wanted one, restart the domU.

I think the last one, would be the best, but I have no idea, whether xen can 
do that at all. I played around with OpenVZ for a short while, that at least 
could do that. Any other ideas, comments, rants are very welcome.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] hb_gui failing to authenticate... although it hasworkedin the past

2007-11-05 Thread Sebastian Reitenbach
John Gardner <[EMAIL PROTECTED]> wrote: 
> 
> > 
> > Well, I don't know how to fix your GUI problem, but what about using 
> > cibadmin -Q -o resources, to get all configured resources, edit the 
output, 
> > add a new IP address resource, and add it cibadmin -U to update it?
> > 
> > Sebastian
> > 
> 
> Seabastian, now I have another problem (lack of knowledge on my part!)
> 
> When I type:
> 
> cibadmin -Q -o resources
> 
> I get:
> 
>  
> id="virtual_ip">
>  
>
>   value="192.168.1.74"/>
>
>  
>
>  
> 
> How would I add another virtual ip address using cibadmin?
> 
> Presumably I'd use the modify (-M) switch?  How do I generate the unique
> nvpair id?
take sth. like this:


   
 
   
 
   
 
   
   
 
   
 
   
 
   
 

and update the cib with
cibadmin -U -o resources -x resources.xml
I am not perfectly sure about whether -U, or -R, or -M is the correct 
parameter. 

where resources.xml is a file containing above example.

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] hb_gui failing to authenticate... although it hasworked in the past

2007-11-05 Thread Sebastian Reitenbach
Hi,

General Linux-HA mailing list  wrote: 
> Dejan Muhamedagic wrote:
> > Hi,
> > 
> > On Wed, Oct 31, 2007 at 01:44:14PM +, John Gardner wrote:
> >> I'm running CentOS 4.5 and I've had heartbeat up and running
> >> successfully for a number of months, and configured it initially with
> >> hb_gui, but I've never really used it since then...  Anyway, 4 months 
on
> >> I powered up hb_gui to make some changes and it won't connect to the
> >> heartbeat server, it gives this error:
> >>
> >> mgmtd: [23909]: ERROR: on_listen receive login msg failed
> > 
> > The first message on the connection was not a proper login
> > message. This used to happen in the past also because of
> > different client and server versions.
> > 
> >> Why would it suddenly stop working?
> > 
> > Are you absolutely sure that nothing changed in the meantime?
> > Things don't just stop working. Typically there is a reason.
> 
> Yeah, I agree.  Things don't stop working... but this has.  OK, that's
> not true.  hb_gui hasn't stopped working, the GUI still appears and
> heartbeat is still operating fine, the only problem is that hb_gui will
> now no longer connect to heartbeat.
> 
> > 
> > Which heartbeat version do you run?
> > 
> 
> I'm using 2.0.8 which is packaged for CentOS.  I'm using 2.0.8. version
> of of hb_gui and 2.0.8 of heartbeat.
> 
> I've installed hb_gui on three separate servers, two access the
> heartbeat server via VPN and the third is on the same subnet as the
> heartbeat server.  All three connect using:
> 
> Server(:port) 192.168.1.65:694
> Username  heartbeat
> Password  xxx
> 
> The 'heartbeat' user is definitely in the haclient group (see /etc/group
> below)
> 
> haclient:x:90:heartbeat
> 
> But every time I try to connect from either hb_gui client I get:
> 
> Failed in the authentication
> User Name or Password may be wrong.
> or the user desn't belong to the haclient group
> 
> on the GUI and the following in the log:
> 
> Nov  5 10:26:56 server01 mgmtd: [25556]: ERROR: on_listen receive login
> msg failed
> Nov  5 10:32:10 server01 mgmtd: [25556]: ERROR: on_listen receive login
> msg failed
> 
> I'm at a bit of a loss what to do next really, I need to add another
> Virtual IP today, but I can't connect :-(

Well, I don't know how to fix your GUI problem, but what about using 
cibadmin -Q -o resources, to get all configured resources, edit the output, 
add a new IP address resource, and add it cibadmin -U to update it?

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Patch to fix long interface names in IPaddr.

2007-11-02 Thread Sebastian Reitenbach
Hi,

General Linux-HA mailing list  wrote: 
> 
> On Nov 2, 2007, at 7:22 AM, Sean Reifschneider wrote:
> 
> > If you have a long interface name, such as "vlan1000", ifconfig cuts  
> > off
> > alias names so that it shows "vlan1000:" instead of "vlan1000:0".   
> > This is
> > on probably pretty much all Linux, but specifically we were using  
> > Debian
> > Etch.  I presume that there would be a similar problem for >9  
> > aliases on an
> > interface named something like "vlan999" or >99 aliases on an  
> > interface
> > named "vlan99".
> >
> > The behavior is that "start" works, but stop tries to remove the  
> > alias from
> > "vlan1000:", which fails, leaving the IP up on the passive machine  
> > if you
> > have gracefully failed over (and STONITH doesn't kill the previously- 
> > active
> > node).
> >
> > I tracked this down via the logs, and Scott Kleihege used his awk-fu  
> > to work
> > up the following patch.  I'm not sure if you'll want to include this  
> > as it
> > relies on the "iproute2" program "ip" to be installed,
> 
> yeah, thats going to be problematic as IPaddr needs to work on non- 
> linux systems.
At least this shouldn't harm OpenBSD, because the alias interface notation 
is different than in Linux, the : notation is not used there.

> 
> IPaddr2 has always been linux specific though...
> 
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] hb_gui failing to authenticate... although it hasworkedin the past

2007-11-01 Thread Sebastian Reitenbach
Hi,

John Gardner <[EMAIL PROTECTED]> wrote: 
> Sebastian Reitenbach wrote:
> > Hi,
> > 
> > General Linux-HA mailing list  wrote: 
> >> I've checked by connecting to the server using hb_gui on the same 
subnet
> >> so I know it's not firewall related.  It has just inexplicitly stopped
> >> working!
> >>
> >> Does any one have any other ideas as to why the hb_gui connection has
> >> stopped working?  Is there any other way to set up a virtual ip without
> >> using hb_gui?
> > I had a similar problem on openSUSE 10.2. Check /etc/pam.d/hbmgmtd, and 
> > change pam_unix.so to pam_unix2.so. That fixed the problem for me on 
> > openSUSE, despite I just checked on SLES10 SP1, and I only have 
pam_unix.so 
> > there but is working.
> > 
> > Sebastian
> > 
> 
> Thanks Sebastian
> 
> Will pam_unix and pam_unix2 coexist on the same box?  On CentOS pam_unix
> is installed from a package, pam_unix2 seems to be only available as
> source... if I build it, presumbly I can keep the other?
> 
unfortunately, I have no idea, haven't tried to use both on the same 
machine, but as far as I can see, why not.

Sebastian 

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] hb_gui failing to authenticate... although it hasworked in the past

2007-11-01 Thread Sebastian Reitenbach
Hi,

General Linux-HA mailing list  wrote: 
> 
> I've checked by connecting to the server using hb_gui on the same subnet
> so I know it's not firewall related.  It has just inexplicitly stopped
> working!
> 
> Does any one have any other ideas as to why the hb_gui connection has
> stopped working?  Is there any other way to set up a virtual ip without
> using hb_gui?
I had a similar problem on openSUSE 10.2. Check /etc/pam.d/hbmgmtd, and 
change pam_unix.so to pam_unix2.so. That fixed the problem for me on 
openSUSE, despite I just checked on SLES10 SP1, and I only have pam_unix.so 
there but is working.

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Xen memory allocation in the cluster

2007-11-01 Thread Sebastian Reitenbach
Hi,

when a virtual Xen machine is migrating from one node to an ohter, the 
remaining virtual hosts on the original node, could potentially allocate 
more memory, and on the new node, the virtual hosts already there have to 
give up sth. of their Memory, to make room for the new one.

As far as I know, Xen is not able to handle this automatically, but it can 
be manually set via xm mem-set command.

I am looking for a OCF resource script that would raise/lower memory usage 
on virtual nodes automagically. Does there exist sth. like this already?


kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Xen live resource migration

2007-10-26 Thread Sebastian Reitenbach
Hi all,

does there is a good reason why the allow_migrate parameter of the Xen 
resource script is no included in the meta-data output?

I had to read the Xen resource script to stumble across that parameter and 
it took me another while to figure out where to specify it.
If there is no real good reason to not make the parameter available in the 
meta-data, then I'd create a patch for it to add it there.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] GUI fails to authenticate hacluster user

2007-08-26 Thread Sebastian Reitenbach
Hi,

> >
> > in /etc/pam.d the hbmgmtd file is there:
> > cat hbmgmtd
> > authrequiredpam unix.so
> > account requiredpam unix.so
> >
> > another GUI instance, still running since yesterday, was still working
> > fine.
> >
> > any idea what could have caused this behaviour?
> >
> > kind regards
> > Sebastian
> 
> use pam unix2.so
> 
thanks a lot, that helped.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] GUI fails to authenticate hacluster user

2007-08-24 Thread Sebastian Reitenbach
Hi,
system: opensuse 10.2, i586
hbversion: 2.1.2

I just tried to use the GUI on one of our clusters, and login as hacluster 
user. The login did not worked. In the logs I can see the following 
messages:
Aug 24 14:18:21 ogo2 mgmtd: pam_unix(hbmgmtd:auth): authentication failure; 
logname= uid=0 euid=0 tty= ruser= rhost=  user=hacluster
Aug 24 14:18:23 ogo2 mgmtd: [10035]: ERROR: on_listen pam auth failed

the password is correct, I tried to su hacluster, using the same password.

in /etc/pam.d the hbmgmtd file is there:
cat hbmgmtd
authrequiredpam_unix.so
account requiredpam_unix.so

another GUI instance, still running since yesterday, was still working fine.

any idea what could have caused this behaviour?

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] cluster chokes after upgrade from 2.0.8 to 2.1.2

2007-08-23 Thread Sebastian Reitenbach
Hi,

crmd[19331]: 2007/08/23_20:21:48 WARN: msg_to_op(1173): failed to get the 
value of field lrm_opstatus from a ha_msg
crmd[19331]: 2007/08/23_20:21:48 info: msg_to_op: Message follows:
crmd[19331]: 2007/08/23_20:21:48 info: MSG: Dumping message with 13 fields
crmd[19331]: 2007/08/23_20:21:48 info: MSG[0] : [lrm_t=op]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[1] : [lrm_rid=IP_SysLog]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[2] : [lrm_op=monitor]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[3] : [lrm_timeout=5]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[4] : [lrm_interval=1]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[5] : [lrm_delay=3]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[6] : [lrm_targetrc=-2]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[7] : [lrm_app=crmd]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[8] : 
[lrm_userdata=38:81:cf036593-e41b-4560-8215-be1aaf753b91]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[9] : [(2)lrm_param=0x80a6d8(373 
461)]
crmd[19331]: 2007/08/23_20:21:48 info: MSG: Dumping message with 15 fields
crmd[19331]: 2007/08/23_20:21:48 info: MSG[0] : [target_role=started]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[1] : [CRM_meta_interval=1]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[2] : [ip=192.168.102.39]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[3] : [CRM_meta_prereq=fencing]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[4] : [CRM_meta_start_delay=3]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[5] : [CRM_meta_role=Started]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[6] : [cidr_netmask=23]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[7] : 
[CRM_meta_id=873df73f-b63a-4645-91d9-7921eec339a1]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[8] : [broadcast=192.168.103.255]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[9] : [CRM_meta_timeout=5]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[10] : [CRM_meta_on_fail=fence]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[11] : [crm_feature_set=2.0]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[12] : [CRM_meta_disabled=false]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[13] : [CRM_meta_name=monitor]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[14] : [nic=bridge0]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[10] : [lrm_callid=41]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[11] : [lrm_app=crmd]
crmd[19331]: 2007/08/23_20:21:48 info: MSG[12] : [lrm_callid=41]

also when I add another node to the cluster, I see above messages, and the 
cluster starts a short time later stonithing itself.

Sebastian

Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA 
mailing list  wrote: 
> 
> 
> Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> > On Thu, Aug 23, 2007 at 09:43:09AM +0200, Sebastian Reitenbach wrote:
> > > Hi,
> > > 
> > > Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA 
> > > mailing list  wrote: 
> > > > Hi list,
> > > > 
> > > > after upgrading a two node cluster from 2.0.8 to 2.1.2, running on 
> SLES 
> > > 10, 
> > > > x86_64, I see every 17 seconds the following line in the logs:
> > > > 
> > > > lrmd[3651]: 2007/08/22_18:28:30 notice: Not currently connected.
> > > > 
> > > > should I worry about that note?
> > > 
> > > after recreating the whole configuration via the GUI point n' click 
> orgy, 
> > > this notice disappeared. Also the problem described below is gone, the 
> > > cluster seems to behave just fine now.
> > > 
> > > > 
> > > > This happens when one node is stopped. Adding the second node to the 
> > > > cluster, then the IPaddr resources start to going crazy. It seems 
that 
> are 
> > > > always the last IP addresses that are configured in the resources 
> cib.xml 
> > > at 
> > > > the end that fail. Some of the IPaddr resources have no problem. The 
> > > > configuration worked for weeks with heartbeat 2.0.8.
> > > > 
> > > > heartbeat spams about 5MB/minute into the logfiles, therefore I do 
not 
> > > want 
> > > > to append them here (:
> > > 
> > > In case anybody is interested in logfiles/configuration old and new 
one, 
> I 
> > > can open a bugzilla entry. 
> > 
> > Yes. If it's not too much trouble. Do you still have your old
> > configuration saved? Did you try to find differences between the
> > old and the new one with crm_diff? The whole thing seems to be
> > quite strange.
> crm_diff produces a looong output, the resoure ID's are different, 
> nevertheless, I created a bugreport:
> 
> http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1694
> 
> I am not perfectly sure about whether the log really is from the time of a 
> problem, but If I see t

Re: [Linux-HA] cluster chokes after upgrade from 2.0.8 to 2.1.2

2007-08-23 Thread Sebastian Reitenbach


Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: 
> On Thu, Aug 23, 2007 at 09:43:09AM +0200, Sebastian Reitenbach wrote:
> > Hi,
> > 
> > Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA 
> > mailing list  wrote: 
> > > Hi list,
> > > 
> > > after upgrading a two node cluster from 2.0.8 to 2.1.2, running on 
SLES 
> > 10, 
> > > x86_64, I see every 17 seconds the following line in the logs:
> > > 
> > > lrmd[3651]: 2007/08/22_18:28:30 notice: Not currently connected.
> > > 
> > > should I worry about that note?
> > 
> > after recreating the whole configuration via the GUI point n' click 
orgy, 
> > this notice disappeared. Also the problem described below is gone, the 
> > cluster seems to behave just fine now.
> > 
> > > 
> > > This happens when one node is stopped. Adding the second node to the 
> > > cluster, then the IPaddr resources start to going crazy. It seems that 
are 
> > > always the last IP addresses that are configured in the resources 
cib.xml 
> > at 
> > > the end that fail. Some of the IPaddr resources have no problem. The 
> > > configuration worked for weeks with heartbeat 2.0.8.
> > > 
> > > heartbeat spams about 5MB/minute into the logfiles, therefore I do not 
> > want 
> > > to append them here (:
> > 
> > In case anybody is interested in logfiles/configuration old and new one, 
I 
> > can open a bugzilla entry. 
> 
> Yes. If it's not too much trouble. Do you still have your old
> configuration saved? Did you try to find differences between the
> old and the new one with crm_diff? The whole thing seems to be
> quite strange.
crm_diff produces a looong output, the resoure ID's are different, 
nevertheless, I created a bugreport:

http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1694

I am not perfectly sure about whether the log really is from the time of a 
problem, but If I see the problem again, I'll readd some new logs.

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] cluster chokes after upgrade from 2.0.8 to 2.1.2

2007-08-23 Thread Sebastian Reitenbach
Hi,

Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA 
mailing list  wrote: 
> Hi list,
> 
> after upgrading a two node cluster from 2.0.8 to 2.1.2, running on SLES 
10, 
> x86_64, I see every 17 seconds the following line in the logs:
> 
> lrmd[3651]: 2007/08/22_18:28:30 notice: Not currently connected.
> 
> should I worry about that note?

after recreating the whole configuration via the GUI point n' click orgy, 
this notice disappeared. Also the problem described below is gone, the 
cluster seems to behave just fine now.

> 
> This happens when one node is stopped. Adding the second node to the 
> cluster, then the IPaddr resources start to going crazy. It seems that are 
> always the last IP addresses that are configured in the resources cib.xml 
at 
> the end that fail. Some of the IPaddr resources have no problem. The 
> configuration worked for weeks with heartbeat 2.0.8.
> 
> heartbeat spams about 5MB/minute into the logfiles, therefore I do not 
want 
> to append them here (:

In case anybody is interested in logfiles/configuration old and new one, I 
can open a bugzilla entry. 

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] cluster chokes after upgrade from 2.0.8 to 2.1.2

2007-08-22 Thread Sebastian Reitenbach
Hi list,

after upgrading a two node cluster from 2.0.8 to 2.1.2, running on SLES 10, 
x86_64, I see every 17 seconds the following line in the logs:

lrmd[3651]: 2007/08/22_18:28:30 notice: Not currently connected.

should I worry about that note?

This happens when one node is stopped. Adding the second node to the 
cluster, then the IPaddr resources start to going crazy. It seems that are 
always the last IP addresses that are configured in the resources cib.xml at 
the end that fail. Some of the IPaddr resources have no problem. The 
configuration worked for weeks with heartbeat 2.0.8.

heartbeat spams about 5MB/minute into the logfiles, therefore I do not want 
to append them here (:

Is this known to somebody else?

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] heartbeat debug rpm packages for suse 10.2?

2007-08-01 Thread Sebastian Reitenbach
Hi,

I am running heartbeat-2.1.0 on openSUSE 10.2, downloaded from here:
http://download.opensuse.org/repositories/server:/ha-clustering/openSUSE_10.2/x86_64/


In /var/log/messages, I saw the lrmd crashing, and restarting:
Aug  1 11:29:37 srv5 tengine: [10166]: info: process_graph_event: Detected 
action resource_PUB_IPS_monitor_0 from a different transition: 190 vs. 195
Aug  1 11:29:37 srv5 tengine: [10166]: info: match_graph_event: Action 
resource_PD_NFS_monitor_0 (15) confirmed on srv5
Aug  1 11:29:37 srv5 cib: [12892]: info: write_cib_contents: Wrote version 
0.122.21866 of the CIB to disk (digest: 8e6e22d91098f6d62a3b5fb6dc1965c2)
Aug  1 11:29:37 srv5 heartbeat: [9640]: WARN: 
Exiting /usr/lib64/heartbeat/lrmd -r process 9840 killed by signal 11 
[SIGSEGV - Segmentation violation].
Aug  1 11:29:37 srv5 heartbeat: [9640]: ERROR: 
Exiting /usr/lib64/heartbeat/lrmd -r process 9840 dumped core
Aug  1 11:29:37 srv5 heartbeat: [9640]: ERROR: Respawning 
client "/usr/lib64/heartbeat/lrmd -r":
Aug  1 11:29:37 srv5 heartbeat: [9640]: info: Starting child 
client "/usr/lib64/heartbeat/lrmd -r" (0,0)
Aug  1 11:29:37 srv5 heartbeat: [12935]: info: 
Starting "/usr/lib64/heartbeat/lrmd -r" as uid 0  gid 0 (pid 12935)


unfortunatley the lrmd does not contain debugging symbols:
srv5:/ # gdb -c core
GNU gdb 6.5
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain 
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Core was generated by `/usr/lib64/heartbeat/lrmd -r'.
Program terminated with signal 11, Segmentation fault.
#0  0x0040801a in ?? ()
(gdb) symbol-file /usr/lib64/heartbeat/lrmd
Reading symbols from /usr/lib64/heartbeat/lrmd...(no debugging symbols 
found)...done.
Using host libthread_db library "/lib64/libthread_db.so.1".
(gdb) bt
#0  0x0040801a in g_str_equal ()
#1  0x0040835c in g_str_equal ()
#2  0x2b301fc80e29 in ?? ()
#3  0x006acd58 in ?? ()
#4  0x in ?? ()


are there debug rpm's available that include debugging symbols?

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] two node cluster preventing split brain?

2007-07-31 Thread Sebastian Reitenbach
Hi list,

I am running heartbeat-2.1.0 on openSUSE 10.2.
I use to use ssh to the ilo of the servers to stonith them, stonith 
generally works well. I configured the two ilo ip addresses as ping nodes  
in /etc/ha.d/ha.cf. I have a clone set pingd, a clone set stonith, and a 
clone set suicide defined, each with max_clones=2 and max_clone_node=1.
Stonith is enabled in the cluster.

Now my test scenario: 
- I remove the cables from the ilo boards
  - the heartbeat correctly detects both as down
- then I remove the network cables from one host
Here the split brain situation happens, well, stonith cannot work, as the 
ilo boards are not reachable by any host.

Here my questions:
Do I can define constraints in the cluster or operations on the suicide 
resource, or pingd resource, so that the actual DC will stay alive, and the 
other node suicides itself?
How is suicide intended to work? Unfortunately it is not a script that I can 
just read?


kind regards
Sebastian

autojoin any
crm true
deadtime 15
initdead 30
keepalive 2
ping 192.168.0.96 192.168.0.97
node srv4
node srv5
mcast bond0 224.0.0.1 700 1 0
cluster MyCluster

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Failover for multiple xDSL/FW

2007-07-30 Thread Sebastian Reitenbach
Hi,

> > Hi to All!
> > 
> > If I have this example configuration:
> > 
> > ROUTER1--- FW1
> >   - LAN/Client
> > ROUTER2---FW2 
> > 
> > 
> > ROUTER1 = 80.0.0.0/29
> > ROUTER2 = 90.0.0.0/29
> > 
> > FW = Linux
> > FW1 (LAN) = 192.168.0.253
> > FW2 (LAN) = 192.168.0.252
> > 
> > GW Client LAN = 192.168.0.254 (HA)
> > 
> > can I use LinuxHA for this solution?

You could, but when you only want to do NAT from inside out, and port 
redirection from outsinde to internal servers, and want to have two 
different static routes, I do this with OpenBSD pf firewall and carp. In 
case of failover it only takes a second, the connection states (tcp, udp, 
whatever), are synchronized between the two nodes, and if you want to use it 
as IPsec VPN endpoint, IPsec flows and associations are synchronized too. So 
in case of a failover, nobody would recognize a broken connection. LinuxHA 
would take much more time to failover. 
When you need dynamic routing, OpenBSD comes with OpenBGPd and OpenOSPFd.

But LinuxHA should work for that too, with a bit slower failover, and 
without the synchronized firewall and ipsec states.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Finding out which host is active...

2007-07-28 Thread Sebastian Reitenbach
Hi,

General Linux-HA mailing list  wrote: 
> Hi,
>  
> I have a two node cluster failing an IP back and forth.
> Is there an easy way to determine which host is currently holding the ip
> address?
just run the crm_mon tool.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HeartBeat doesn't see my process is down.

2007-07-27 Thread Sebastian Reitenbach
Hi,

General Linux-HA mailing list  wrote: 
> 
>   Hello,
> 
>   I configured Heartbeat 2.0.8-2.el4 to start my web server Apache (in
> haresources) on Linux when it starts.
>   The heartbeat configuration runs well in case of hardware crashes 
but if
> the web server goes down only, Heartbeat doesn't see Apache is down and 
doesn't
> send a message to the second server to start its Apache.
>   I check the script /etc/init.d/httpd status and it returns code 1.
>   So I don't understand where is the problem ?
do you have a monitor action defined for the resource?
http://www.linux-ha.org/ClusterInformationBase/Actions

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Help Getting Started

2007-07-26 Thread Sebastian Reitenbach
Hi,

> It appears that I already have a cib.xml in /var/lib/heartbeat/crm/ from
> the initial set up.
> 
>   ignore_dtd="false" ccm_transition="8" num_peers="2"
> cib_feature_revision="1.3" dc_uuid="b0bd581b-950c-4fa9-ad25-b1f288b
> 03123" epoch="7" num_updates="47" cib-last-written="Thu Jul 19 19:49:34
> 2007">
>
>  
>  
> type="normal"/>
> type="normal"/>
>  
>  
>  
>
>  

this is the initial cib.xml, created from the information configured 
in /etc/ha.d/ha.cf, therefore only the nodes are in there.
> 
> Presumably, if I use the GUI, it will add things to this default file?
> 
yes, exactly, it will add the resources, constraints... there.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Help Getting Started

2007-07-26 Thread Sebastian Reitenbach
Hi,

> 
> So, can anyone tell me how I can proceed?  I guess that the next step is
Do you have the GUI installed to? then use that to create the first 
resources. This will create an initial cib.xml file 
in /var/lib/heartbeat/crm.

> to create a cib.xml file which specifies the virtual ip address?  Also
> in the ha.cf sometimes it shows 'crm on' and other times 'crm yes'...
> which is correct?
both are correct, also a true or 1 would be the same.

Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] active active failover NFS server?

2007-07-25 Thread Sebastian Reitenbach
Hi,

> > Thanks a lot, there I found a link to an active-active NFS HA tutorial 
at 
> > http://chilli.linuxmds.com/~mschilli/NFS/active-active-nfs.html, 

after fiddling around with the files there I got it working on the command 
line, running 
/etc/ha.d/resource.d/nfs servername start

or 
etc/ha.d/resource.d/nfs servername stop

to mount and umount the filesystems and export them. When I configure the 
resource to be managed from heartbeat, then the script is started with no 
parameters when started. The script seems to be for an older version, 1.X, 
of heartbeat. Does it is only in the wrong place, or not compatible anymore 
and needs more tweaking? 

I am using heartbeat-2.1.0 on opensuse 10.2.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] how many resources does linux-ha can handle?

2007-07-25 Thread Sebastian Reitenbach
Hi,
 
> > > 
> > > I don't want to add my resources as cib file here, because it is more 
> than 
> > > 20 pages printed out :)
> > Well you could attach the CIB (bzip2 is your friend). Without it no one 
> can help 
> > here. So we can see which resource failed and maybe where the problem in
> > your configuration is.
> > 
> I think the hint with the timings is good, and I'll first try to change 
the 
> script. If that doesn't help, then I'll post it here.
> 
just for the records:

I changed the IPaddr script to handle a group of IP addresses. That reduced 
the number of resources a lot. Additionally I had to add an order for the IP 
resources to make it work smoothly.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat-gui documentation?

2007-07-24 Thread Sebastian Reitenbach
Hi,

> 
> Or at least tell me what username & password to give it to logon? Using 
> my server's root account doesn't seem to work
it is the password of the user that runs the heartbeat daemon, in Linux 
usually hacluster. You have to assign a password to that user in the system.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] alias IP addresses on OpenBSD

2007-07-24 Thread Sebastian Reitenbach
Hi,

I found this snippet in the IPaddr script:

find_interface_bsd() {
#$IFCONFIG $IFCONFIG_A_OPT | grep "inet.*[: ]$OCF_RESKEY_ip "
$IFCONFIG | grep "$ipaddr" -B20 | grep "UP," | tail -n 1 | cut -d ":" -f 
1
}

#
#   Find out which alias serves the given IP address
#   The argument is an IP address, and its output
#   is an aliased interface name (e.g., "eth0:0").
#

To see the aliases of an interface shown by ifconfig, you have to add the 
parameter -A to ifconfig:

but running the command gives me:
ifconfig -A | grep 213.239.221.55 -B20 | grep "UP," | tail -n 1 | 
cut -d ":" -f 1

but then it still gives me:
fxp0 

because there are no fxp0:0 or eth0:0 alias interfaces on OpenBSD. But when 
I take a look into the delete_interface function in the same script, it 
looks like it will work.

here an example output of ifconfig

ifconfig -A
other interfaces ...
fxp0: flags=8843 mtu 1500
lladdr 00:02:b3:88:e6:41
groups: egress
media: Ethernet autoselect (100baseTX full-duplex)
status: active
inet 21.39.21.41 netmask 0xffe0 broadcast 21.39.21.63
inet6 fe80::202:b3ff:fe88:e641%fxp0 prefixlen 64 scopeid 0x1
inet 21.39.21.55 netmask 0xff00 broadcast 21.39.21.255
other interfaces...

the line 
inet 21.39.21.55 netmask 0xff00 broadcast 21.39.21.255
defines the alias IP address. This line will not show up, without the -A 
parameter.

well, I only tried to run the command manually on the commandline not from 
within linux-ha, but as far as I can see, the script would not find the ip 
address of the interface.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] how many resources does linux-ha can handle?

2007-07-23 Thread Sebastian Reitenbach
Hi,

> From my experience I recommend not to use the GUI. It never did the job
> for me (specially not creating a configuration).
yeah, the GUI is a bit...

> 
> From your description I assume you have timing problems. Keep in mind that
> cluster node startups really generate a load on the HA system. Each 
resource
> is probed (basically it runs a 'monitor' operation on each resource on 
each 
> cluster  node).
> 
> So if you have 2 nodes with 40 resources a node startup ---> 80 monitor 
actions 
> initiated ---> 80 responses ---> 80 changes in the CIB --> 80 
redistributions 
> (not to mention the engine calulcating your failove-rules for all the 
resources).
that's a good hint, thanks.

> 
> Did you write Resource Agents on your own? Or do you use only the standard
> HA RA?
I only use the standard resource files in that cluster, only for stonith I 
use self written scripts to kill the other nodes via ssh to the iLo board.

> 
> Are you using clones?

I have about 9 clone sets, and the same number of groups, each group 
containing 9 or 10 resources. 

Maybe I can try changeing the IPaddr script to allow me to give it a list of 
IP addresses, and a list of devices. then each group only consists of 2 or 
less resources. As far as I can see, that could drop the load from the 
server and maybe fix the timing problems.

> 
> You see ... attaching the CIB as attachment would help ;-)
> 
> > 
> > I don't want to add my resources as cib file here, because it is more 
than 
> > 20 pages printed out :)
> Well you could attach the CIB (bzip2 is your friend). Without it no one 
can help 
> here. So we can see which resource failed and maybe where the problem in
> your configuration is.
> 
I think the hint with the timings is good, and I'll first try to change the 
script. If that doesn't help, then I'll post it here.

thanks
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] active active failover NFS server?

2007-07-22 Thread Sebastian Reitenbach
Hi,

"Sebastian Reitenbach" <[EMAIL PROTECTED]> wrote: 
> Hi,
> 
> > 
> > You can start here: http://linux-ha.org/HaNFS
> > 
> > > 
> Thanks a lot, there I found a link to an active-active NFS HA tutorial at 
> http://chilli.linuxmds.com/~mschilli/NFS/active-active-nfs.html, 
> unfortunately I do not get an IP for the hostname, therefore I only found 
it 
> in Google Cache. They use exportfs [-u] to add or remove mount points on 
the 
> nfs servers.
> There is a HA-NFS.tar mentioned, to be downloaded from the same site, 
> anybody has this somewhere else available?
> 
Nevermind, I got that file, the server was reachable again.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] how many resources does linux-ha can handle?

2007-07-20 Thread Sebastian Reitenbach
Hi list,

I am trying to setup a two node cluster with lots of cloned services, ldap, 
dns, squid, cups, tftp, and active active nfs,... Each of the two nodes is a 
member in nine vlan's. For each service, a group of 9 virtual addresses is 
configured. Every resource is monitored, and in case it fails, the node 
should be fenced.
Up to about 40 or 50 resouces, everything is working as expected. when 
suspending or reactivating the cluster, some resources start to fail and the 
GUI becomes so unresponsive, so that I have to restart it. when I add more 
resources, everything gets more wild, so far, that when I suspend or rejoin 
a node via the GUI, the GUI freezes, and then the crm_mon too is unable to 
connect to the cluster, on any node, so that the heartbeat has to be 
restarted.

I don't want to add my resources as cib file here, because it is more than 
20 pages printed out :)

how many resources does the linux-ha cluster can manage? would it help to 
tweak some timings, if so, which would that be? Or would it help to reduce 
the load when I e.g. change the IPaddress resource to manage a group of 
aliases, for each vlan?

any experiences and hints appreciated.

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] active active failover NFS server?

2007-07-20 Thread Sebastian Reitenbach
Hi,

> 
> You can start here: http://linux-ha.org/HaNFS
> 
> > 
Thanks a lot, there I found a link to an active-active NFS HA tutorial at 
http://chilli.linuxmds.com/~mschilli/NFS/active-active-nfs.html, 
unfortunately I do not get an IP for the hostname, therefore I only found it 
in Google Cache. They use exportfs [-u] to add or remove mount points on the 
nfs servers.
There is a HA-NFS.tar mentioned, to be downloaded from the same site, 
anybody has this somewhere else available?

kind regards

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] active active failover NFS server?

2007-07-19 Thread Sebastian Reitenbach
Hi list,

I am going to build an active active NFS server, where one exports a public 
directory, and the other the home directories. In case one fails, both 
should be exported by the remaining one server. 
I have a shared storage on a SAN, connected to both servers, I use the 
Filesystem ocf script to mount/umount the partitions (ext3, ocfs2 doesn't 
have ACL's, and I do not get GFS2 to work). Therefore I cannot run a nfs 
server clone, because I cannot umount the partition when the nfs server 
still lives on it, and the shared IP is wandering. I only see the LSB Script 
available for managing the nfsserver, but with the LSB script, only one NFS 
server can be started or stopped.


So I have to configure two NFS resources using the LSB script, so that both 
can life on different servers. But now when I manually tell on nfs resource 
to move to another server, then both nfs resources will not be available for 
a short time. I also saw some problems when a dead node comes back into the 
cluster, also both nfs server resouces were not available for a short time.

An other option would be to create a OCF script (I haven't found one) to 
manage the nfsserver. In the manual page of rpc.mountd I have seen that it 
is possible to specify a exports file and the port automatically. But I 
don't know what kind of other problems I might get, or whether it will be 
possible to run two nfsservers in parallel.

anybody has an idea?

kind regards
Sebastian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems