[Linux-HA] which ipmi stonith plugin to use?
Hi, I want to setup a cluster running on IBM servers. I've seen there is an internal ipmilan, and an external/ipmi stonith plugin available. Any recommendation which one I should use? kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] ldirectord load balancing decision on rtt time?
Hello everybody, I have an FTP server, connected to the internet with two 16MBit DSL lines. I have another root server in a remote location, where I setup two openvpn tunnels to the ftp server, each tunnel over a separate DSL line. I want to upload files, with heavily varying sizes, and distribute the traffic as best on both DSL lines, so that when there are multiple uploads, both lines are saturated. Due to the nature of ftp, I'd only load balance the control connection, and go directly with the data connection from the client, to the server. Therefore round robin, or similar connection count based scheduling algorithms don't scale that well. Does ldirectord can make the decision which server to use as the next target, based on e.g. round trip times of icmp packets? E.g. one upload is running, assuming one DSL line is more or less fully saturated, the next connection to ldirectord comes in, ldirectord will check the RTT times of icmp packets to both FTP servers via the tunnel, and then make a decision which one is the next, on the RTT times it measures? Or can ldirectord make a decision on the timings of the health checks? E.g. whenever a new connection comes in, a health check to the FTP servers are initiated, and the one with the fastest answer will get the next connection? as far as I read the ldirectord manual page, the scheduling algorithms to choose do not seem to provide such functionality. I've seen, I can use external scripts to do health checking, but these seem to only return alive or dead, but not a qualitative statement, how fast reachable the servers are. In case, ldirectord cannot help me right now, where should I look into the code, when I want to get an idea, on how to implement sth. like above as scheduling algorithm? here I read, that there is also a kernel module managing ftp connections when using ldirectord for LVS: http://www.ultramonkey.org/3/topologies/lb-eg.html But they only mention old 2.4 kernel version of Linux, so I wonder if that would work with a modern 2.6 kernel? kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] mgmtd not starting on opensuse 11i386(unresolvedsymbol)
Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > Hi, > > On Wed, Jul 30, 2008 at 08:55:53AM +0200, Sebastian Reitenbach wrote: > > General Linux-HA mailing list wrote: > > > On Mon, Jul 28, 2008 at 05:52:10PM -, root wrote: > > > > Hi, > > > > Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > > > > > On Mon, Jul 28, 2008 at 04:41:27PM +0200, Sebastian Reitenbach wrote: > > > > > > Hi, > > > > > > > > > > > > I just upgraded my desktop to opensuse 11.0 i586, and updated the > > box, > > > > then > > > > > > installed the heartbeat rpm's 2.1.3 from download.opensuse.org. > > > > > > > > > > > > I've these rpm's installed right now: > > > > > > pacemaker-heartbeat-0.6.5-8.2 > > > > > > heartbeat-common-2.1.3-23.1 > > > > > > heartbeat-resources-2.1.3-23.1 > > > > > > heartbeat-2.1.3-23.1 > > > > > > pacemaker-pygui-1.4-1.3 > > > > > > > > > > > > I've added these lines to /etc/ha.d/ha.cf to start mgmtd > > automatically: > > > > > > apiauth mgmtd uid=root > > > > > > respawn root/usr/lib/heartbeat/mgmtd -v > > > > > > > > > > > > but mgmtd fails to start, when I try to start it on the commandline, > > then > > > > I > > > > > > see the following output: > > > > > > > > > > > > /usr/lib/heartbeat/mgmtd: symbol lookup > > error: /usr/lib/libpe_status.so.2: > > > > > > undefined symbol: stdscr > > > > > > > > > > > > As far as I researched now, the stdscr symbol is expected to come > > from > > > > > > ncurses? > > > > > > > > > > Looks like a dependency problem. Does the package containing > > > > > mgmtd depend on the ncurses library? Though I don't understand > > > > > why mgmtd needs ncurses. > > > > I found this out, in a thread in some m/l, regarding the error message > > about > > > > the undefined symbol, but maybe this is just wrong. > > > > > > stdscr is an external variable defined in ncurses.h which is > > > included from ./lib/crm/pengine/unpack.h which is part of the > > > code that gets built in libpe_status. The pacemaker rpm, which > > > includes that library, does depend on libncurses. Is that the > > > case with the pacemaker you downloaded? > > I've these installed: > > rpm -qa | grep -i ncurs > > ncurses-utils-5.6-83.1 > > libncurses5-5.6-83.1 > > yast2-ncurses-pkg-2.16.14-0.1 > > yast2-ncurses-2.16.27-8.1 > > > > rpm -q --requires pacemaker-heartbeat > > /bin/sh > > /bin/sh > > /sbin/ldconfig > > /sbin/ldconfig > > rpmlib(PayloadFilesHavePrefix) <= 4.0-1 > > rpmlib(CompressedFileNames) <= 3.0.4-1 > > /bin/sh > > /usr/bin/python > > libbz2.so.1 > > libc.so.6 > > libc.so.6(GLIBC_2.0) > > libc.so.6(GLIBC_2.1) > > libc.so.6(GLIBC_2.1.3) > > libc.so.6(GLIBC_2.2) > > libc.so.6(GLIBC_2.3) > > libc.so.6(GLIBC_2.3.4) > > libc.so.6(GLIBC_2.4) > > libccmclient.so.1 > > libcib.so.1 > > libcrmcluster.so.1 > > libcrmcommon.so.2 > > libdl.so.2 > > libgcrypt.so.11 > > libglib-2.0.so.0 > > libgnutls.so.26 > > libgnutls.so.26(GNUTLS_1_4) > > libgpg-error.so.0 > > libhbclient.so.1 > > liblrm.so.0 > > libltdl.so.3 > > libm.so.6 > > libncurses.so.5 > > libpam.so.0 > > libpam.so.0(LIBPAM_1.0) > > libpcre.so.0 > > libpe_rules.so.2 > > libpe_status.so.2 > > libpengine.so.3 > > libplumb.so.1 > > librt.so.1 > > libstonithd.so.0 > > libtransitioner.so.1 > > libxml2.so.2 > > libz.so.1 > > rpmlib(PayloadIsLzma) <= 4.4.2-1 > > > > > > rpm -ql libncurses5-5.6-83.1 > > /lib/libncurses.so.5 > > /lib/libncurses.so.5.6 > > ... > > > > so it does require ncurses, but it is installed. > > > > > > but > > nm /lib/libncurses.so.5.6 > > nm: /lib/libncurses.so.5.6: no symbols > > That's fine, it means that the binary is stripped. If you take a > look at libncurses.a (which is probably only in the development > package), you should see some symbols. BTW, you can also try > objdump with -T: > > $ objdump -T libncurses.so.5 | grep stdscr > 0015a630 gDO .bss 0008 Base stdscr here I have: objdump -T /lib64/libncurses.so.5 | grep stdscr 002465e8 gDO .bss 0008 Basestdscr Meanwhile I observed the problem on a opensuse 10.3 i386 and on opensue 11 x86_64 too. Seems like there is a general problem with this version. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] problem with pingd updating the state to attrd
Hi, I'm on SLES10 and use heartbeat-2.1.3. it worked since weeks, but for some reason, after a reboot of the two node cluster, pingd has a problem on one of the hosts to tell the attrd, that it can ping the ping node: heartbeat[3890]: 2008/07/29_07:44:51 info: glib: ping heartbeat started. heartbeat[3890]: 2008/07/29_07:44:53 info: Status update for node 192.168.0.1: status ping cib[3932]: 2008/07/29_07:45:06 info: log_data_element: readCibXmlFile: [on-disk] cib[3932]: 2008/07/29_07:45:06 info: log_data_element: readCibXmlFile: [on-disk] cib[3932]: 2008/07/29_07:45:06 info: log_data_element: readCibXmlFile: [on-disk] cib[3932]: 2008/07/29_07:45:06 info: log_data_element: readCibXmlFile: [on-disk] cib[3932]: 2008/07/29_07:45:06 info: log_data_element: readCibXmlFile: [on-disk] cib[3932]: 2008/07/29_07:45:06 info: log_data_element: readCibXmlFile: [on-disk] attrd[3935]: 2008/07/29_07:46:35 info: find_hash_entry: Creating hash entry for pingd attrd[3935]: 2008/07/29_07:46:35 info: attrd_perform_update: Sent delete -22: pingd (null) status heartbeat[3890]: 2008/07/29_07:49:12 ERROR: MSG: Dumping message with 23 fields on the boxmaster102, everything is fine, but on the boxmaster101, the pingd has the above shown problem. any idea, how to get the pingd update attrd again? kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] mgmtd not starting on opensuse 11 i386(unresolvedsymbol)
General Linux-HA mailing list wrote: > On Mon, Jul 28, 2008 at 05:52:10PM -, root wrote: > > Hi, > > Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > On Mon, Jul 28, 2008 at 04:41:27PM +0200, Sebastian Reitenbach wrote: > > > > Hi, > > > > > > > > I just upgraded my desktop to opensuse 11.0 i586, and updated the box, > > then > > > > installed the heartbeat rpm's 2.1.3 from download.opensuse.org. > > > > > > > > I've these rpm's installed right now: > > > > pacemaker-heartbeat-0.6.5-8.2 > > > > heartbeat-common-2.1.3-23.1 > > > > heartbeat-resources-2.1.3-23.1 > > > > heartbeat-2.1.3-23.1 > > > > pacemaker-pygui-1.4-1.3 > > > > > > > > I've added these lines to /etc/ha.d/ha.cf to start mgmtd automatically: > > > > apiauth mgmtd uid=root > > > > respawn root/usr/lib/heartbeat/mgmtd -v > > > > > > > > but mgmtd fails to start, when I try to start it on the commandline, then > > I > > > > see the following output: > > > > > > > > /usr/lib/heartbeat/mgmtd: symbol lookup error: /usr/lib/libpe_status.so.2: > > > > undefined symbol: stdscr > > > > > > > > As far as I researched now, the stdscr symbol is expected to come from > > > > ncurses? > > > > > > Looks like a dependency problem. Does the package containing > > > mgmtd depend on the ncurses library? Though I don't understand > > > why mgmtd needs ncurses. > > I found this out, in a thread in some m/l, regarding the error message about > > the undefined symbol, but maybe this is just wrong. > > stdscr is an external variable defined in ncurses.h which is > included from ./lib/crm/pengine/unpack.h which is part of the > code that gets built in libpe_status. The pacemaker rpm, which > includes that library, does depend on libncurses. Is that the > case with the pacemaker you downloaded? I've these installed: rpm -qa | grep -i ncurs ncurses-utils-5.6-83.1 libncurses5-5.6-83.1 yast2-ncurses-pkg-2.16.14-0.1 yast2-ncurses-2.16.27-8.1 rpm -q --requires pacemaker-heartbeat /bin/sh /bin/sh /sbin/ldconfig /sbin/ldconfig rpmlib(PayloadFilesHavePrefix) <= 4.0-1 rpmlib(CompressedFileNames) <= 3.0.4-1 /bin/sh /usr/bin/python libbz2.so.1 libc.so.6 libc.so.6(GLIBC_2.0) libc.so.6(GLIBC_2.1) libc.so.6(GLIBC_2.1.3) libc.so.6(GLIBC_2.2) libc.so.6(GLIBC_2.3) libc.so.6(GLIBC_2.3.4) libc.so.6(GLIBC_2.4) libccmclient.so.1 libcib.so.1 libcrmcluster.so.1 libcrmcommon.so.2 libdl.so.2 libgcrypt.so.11 libglib-2.0.so.0 libgnutls.so.26 libgnutls.so.26(GNUTLS_1_4) libgpg-error.so.0 libhbclient.so.1 liblrm.so.0 libltdl.so.3 libm.so.6 libncurses.so.5 libpam.so.0 libpam.so.0(LIBPAM_1.0) libpcre.so.0 libpe_rules.so.2 libpe_status.so.2 libpengine.so.3 libplumb.so.1 librt.so.1 libstonithd.so.0 libtransitioner.so.1 libxml2.so.2 libz.so.1 rpmlib(PayloadIsLzma) <= 4.4.2-1 rpm -ql libncurses5-5.6-83.1 /lib/libncurses.so.5 /lib/libncurses.so.5.6 ... so it does require ncurses, but it is installed. but nm /lib/libncurses.so.5.6 nm: /lib/libncurses.so.5.6: no symbols kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] mgmtd not starting on opensuse 11 i386 (unresolved symbol)
Hi, I just upgraded my desktop to opensuse 11.0 i586, and updated the box, then installed the heartbeat rpm's 2.1.3 from download.opensuse.org. I've these rpm's installed right now: pacemaker-heartbeat-0.6.5-8.2 heartbeat-common-2.1.3-23.1 heartbeat-resources-2.1.3-23.1 heartbeat-2.1.3-23.1 pacemaker-pygui-1.4-1.3 I've added these lines to /etc/ha.d/ha.cf to start mgmtd automatically: apiauth mgmtd uid=root respawn root/usr/lib/heartbeat/mgmtd -v but mgmtd fails to start, when I try to start it on the commandline, then I see the following output: /usr/lib/heartbeat/mgmtd: symbol lookup error: /usr/lib/libpe_status.so.2: undefined symbol: stdscr As far as I researched now, the stdscr symbol is expected to come from ncurses? Do I can link the existing library/binary against ncurses, without the need to recompile? kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] strange behaviour of resources in a group when one resource is stopped
"Andrew Beekhof" <[EMAIL PROTECTED]> wrote: > On Wed, Feb 27, 2008 at 6:05 PM, Sebastian Reitenbach > <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I have a two node cluster, the resources are divided into two groups. Both > > groups are collocated=false and ordered=false. There are some constraints to > > keep the group on a given host, and some orders, but nothing what could > > explain me why happened what I have seen. > > > > I stopped one resource in group_Master102, AFH_5, so this one stopped, > > with target_role? I clicked in the GUI, and said stop, so whatever the GUI is doing. > > > but > > all other resources in that group were restarted too. I cannot see the > > reason why the other resources in that group were restarted too, > > stopped and restarted or just sent an additional start action? I've seen it stopping and restarting in the GUI, and with crm_mon. For a short time, the resource was in state stopped, and then started again. > > > when I just > > wanted to shutdown only one of them. > > I expected only the AFH_5 resource to be stopped, and the others just stay > > running. When starting a stopped resource in a group, this phenomenon does > > not happen. I've appended the output of hb_report tool. > > > > I'm on sles10sp1 x86_64, running heartbeat 2.1.2. > > Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] question regarding orderings in resource groups
Hi, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > > > but what about the other thing I mentioned, is this then a bug? > > with the three resources in the collocated, unordered group. I've > > seen the > > seond and third resource stopping, when I shutdown the second, but > > the first > > still left running. > > Thats the correct behavior (and by design actually). thanks for pointing out. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problem with resource groups and colocations
Hi, General Linux-HA mailing list wrote: > On Tue, Mar 11, 2008 at 2:32 PM, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > Hi, > > > > > > On Tue, Mar 11, 2008 at 11:56:07AM +0100, Andreas Kurz wrote: > > > On Tue, Mar 11, 2008 at 11:02 AM, Sebastian Reitenbach > > > <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > > > > > I want to achieve the following: > > > > I have two groups of resources, these shall run on the same host, and > > > > startup in a given order. > > > > > > > > Therefore I created an order and an collocation constraint. > > > > So group1 starts before group2, and the collocation says, if group1 is not > > > > able to run on a node, group2 will not start. > > > > > > > > However, if all resources in group1 are started, then the resources in > > > > group2 are started too. But when I then shutdown any single resource in > > > > group1, then group2 stops working too. > > > > I am not sure, whether my collocation or order is the reason for the > > > > observed behavior. > > > > > > I have not tried it by myself but there are these "Advisory-Only > > > Ordering" constraints: > > > > > > > > > > Good advice. Though even when all of the group1 is stopped, > > group2 won't stop either. > > According to the documentation it should ... and after doing some > ptests, I see it does ;-). If the complete group1 is stopped, group2 > is stopped to. If only one resource in group1 is stopped, group2 does > nothing. sorry for my late reply, but I had not time in between to test, and I have to say, thanks a lot, this works very well, exactly what I wanted. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] problem with resource groups and colocations
Hi, I want to achieve the following: I have two groups of resources, these shall run on the same host, and startup in a given order. Therefore I created an order and an collocation constraint. So group1 starts before group2, and the collocation says, if group1 is not able to run on a node, group2 will not start. However, if all resources in group1 are started, then the resources in group2 are started too. But when I then shutdown any single resource in group1, then group2 stops working too. I am not sure, whether my collocation or order is the reason for the observed behavior. Right now, from what I observed, it seems, that when I stop just one resource in the group, then the group itself is seen as stopped, and the group2 stops then too. My question is, is there a parameter or something to keep the group as "started", until all resources in the group, or the group completely is set to be stopped? is that possible? Otherwise I'd have use very many single resources, and very many order and collocation constraints. cheers Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ERROR: crm_abort: ha_set_tm_time: Triggered assertat iso8601.c:887
Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote: > On 2008-02-29T08:30:37, Sebastian Reitenbach <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I've seen these messages appearing when I connect the hb_gui to the mgmtd: > > > > mgmtd[6819]: 2008/02/29_08:03:51 ERROR: crm_abort: ha_set_tm_time: Triggered > > assert at iso8601.c:887 : rhs->tm_mday < 0 || lhs->days == rhs->tm_mday > > Hey, besides this being an obviously fairly embarrassing bug ;-), did > anyone actually observe any misbehaviour except the dangerously looking > error message? No, the messages was just only disturbing, did not recognized any other problems related to that message. Also the error disappeared from crm_veryfy -LV output on the 1st of March. > > Feedback on this question would greatly help in assessing the urgency > with which we need to push the update. My guess is that it doesn't > affect the actual operation of the cluster, except scaring the hell out > of the admin ... > cheers Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ERROR: crm_abort: ha_set_tm_time: Triggered assertatiso8601.c:887
Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > Hi, > > On Fri, Feb 29, 2008 at 01:34:38PM +0100, Sebastian Reitenbach wrote: > > "Damon Estep" <[EMAIL PROTECTED]> wrote: > > > I am getting it too, after midnight on 2/29 in leap year - looks like a > > > date bug to me :) > > that would explain that I haven't seen it before, and hopefully it will be > > fixed automagically by tomorrow ;) > > > > maybe I should create a bug report so that it gets fixed. > > Yes, please. there it is: http://developerbugs.linux-foundation.org/show_bug.cgi?id=1850 cheers Sebastian > > Thanks, > > Dejan ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
RE: [Linux-HA] ERROR: crm_abort: ha_set_tm_time: Triggered assert atiso8601.c:887
"Damon Estep" <[EMAIL PROTECTED]> wrote: > I am getting it too, after midnight on 2/29 in leap year - looks like a > date bug to me :) that would explain that I haven't seen it before, and hopefully it will be fixed automagically by tomorrow ;) maybe I should create a bug report so that it gets fixed. thanks Sebastian > > > -Original Message- > > From: [EMAIL PROTECTED] [mailto:linux-ha- > > [EMAIL PROTECTED] On Behalf Of Sebastian Reitenbach > > Sent: Friday, February 29, 2008 12:31 AM > > To: linux-ha@lists.linux-ha.org > > Subject: [Linux-HA] ERROR: crm_abort: ha_set_tm_time: Triggered assert > > atiso8601.c:887 > > > > Hi, > > > > I've seen these messages appearing when I connect the hb_gui to the > > mgmtd: > > > > mgmtd[6819]: 2008/02/29_08:03:51 ERROR: crm_abort: ha_set_tm_time: > > Triggered > > assert at iso8601.c:887 : rhs->tm_mday < 0 || lhs->days == > rhs->tm_mday > > > > and they are also shown in the output of crm_verify: > > > > crm_verify -LV > > crm_verify[17297]: 2008/02/29_08:24:25 ERROR: crm_abort: > > ha_set_tm_time: > > Triggered assert at iso8601.c:887 : rhs->tm_mday < 0 || lhs->days == > > rhs->tm_mday > > > > I've made sure using ntp, that both cluster nodes have the same time. > > I'm wondering what this message is all about, and how I could clean > > that up? > > To clean up failed resources, I'd take crm_resource, but how do I > clean > > this? I already tried to shutdown both cluster nodes, and then > removing > > all > > vital stuff from /var/lib/heartbeat, e.g. rm crm/* h* delhostcache > > pengine/*, and then restarted heartbeat, and loaded the cluster > > configuration again. Then when connecting the hb_gui again, the error > > message showed up again. > > > > I'm on sles10sp1, running heartbeat 2.1.3 > > > > kind regards > > Sebastian > > > > ___ > > Linux-HA mailing list > > Linux-HA@lists.linux-ha.org > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] ERROR: crm_abort: ha_set_tm_time: Triggered assert at iso8601.c:887
Hi, I've seen these messages appearing when I connect the hb_gui to the mgmtd: mgmtd[6819]: 2008/02/29_08:03:51 ERROR: crm_abort: ha_set_tm_time: Triggered assert at iso8601.c:887 : rhs->tm_mday < 0 || lhs->days == rhs->tm_mday and they are also shown in the output of crm_verify: crm_verify -LV crm_verify[17297]: 2008/02/29_08:24:25 ERROR: crm_abort: ha_set_tm_time: Triggered assert at iso8601.c:887 : rhs->tm_mday < 0 || lhs->days == rhs->tm_mday I've made sure using ntp, that both cluster nodes have the same time. I'm wondering what this message is all about, and how I could clean that up? To clean up failed resources, I'd take crm_resource, but how do I clean this? I already tried to shutdown both cluster nodes, and then removing all vital stuff from /var/lib/heartbeat, e.g. rm crm/* h* delhostcache pengine/*, and then restarted heartbeat, and loaded the cluster configuration again. Then when connecting the hb_gui again, the error message showed up again. I'm on sles10sp1, running heartbeat 2.1.3 kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] meta_attributes twice for some resources
Hi, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > Hi, > > On Wed, Feb 27, 2008 at 05:30:43PM +0100, Lars Marowsky-Bree wrote: > > On 2008-02-27T17:18:32, Sebastian Reitenbach <[EMAIL PROTECTED]> wrote: > > > > > Your conclusions are more or less the same, I had. > > > However, I'll create a bug report later. unfortunately, still no idea how it > > > happened. We removed the duplicate entries, replacing the CIB > > > (cibadmin -R -o resources), then everything was working normal again. We try > > > to fiddle around a bit with the cluster, to try to reproduce the problem. > > > When we figure out, what caused it, then I'll add this to the bug report > > > too, but I'm not very optimistic about that yet ;) > > > > All commandline tools log how they were invoked. All CIB states are > > archived in /var/lib/heartbeat/pengine; but yes, this looks as if it was > > caused by the GUI somehow. > > Could be. If that's the case, then it's really a nuisance. Or it > could have been the cibadmin. I guess that crm_resource wouldn't > create another meta_attributes section if there's already one > present. > haven't been able to reproduce the problem again yet, but I created a bug report http://developerbugs.linux-foundation.org/show_bug.cgi?id=1848 sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] strange behaviour of resources in a group when one resource is stopped
Hi, I have a two node cluster, the resources are divided into two groups. Both groups are collocated=false and ordered=false. There are some constraints to keep the group on a given host, and some orders, but nothing what could explain me why happened what I have seen. I stopped one resource in group_Master102, AFH_5, so this one stopped, but all other resources in that group were restarted too. I cannot see the reason why the other resources in that group were restarted too, when I just wanted to shutdown only one of them. I expected only the AFH_5 resource to be stopped, and the others just stay running. When starting a stopped resource in a group, this phenomenon does not happen. I've appended the output of hb_report tool. I'm on sles10sp1 x86_64, running heartbeat 2.1.2. kind regards Sebastian report.out2.tar.gz Description: application/compressed ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] meta_attributes twice for some resources
Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > Hi, > > On Tue, Feb 26, 2008 at 06:21:58PM +0100, Sebastian Reitenbach wrote: > > Hi, > > > > I was wondering today why some of the resources in the cluster behaved > > strangely, e.g. did not reacted on "start/stop/clean up" when clicking in > > the GUI. Then I tried this with crm_resource, and it was whining because the > > target_role matched twice, and it did not know, which one to use. So I took > > a look at the CIB, and found sth. like below, for a bunch of resources: > > > > > > > > > > > value="/pps/sw/bin/PPS/Control-BoxThread"/> > > > name="dropbox_pid_file" value="/var/run/Dropbox/ArchiveFileHandler.pid"/> > > > value="4"/> > > > > > > > > > > > > > > > > > value="stopped"/> > > > > > > > > > > > > note the double meta_attributes. I know, I've no configuration files here, > > as I have no idea, when this happened in the last one or two days. I'm just > > asking, maybe someone has seen sth. like this before, and maybe could share > > the info what might have caused it? > > Really can't say. The one with the longish id was most probably > created by the GUI. The CRM won't touch an attribute if it finds > more than one. Then the id must be specified as well. The only > solution is to drop one of the meta_attributes or > instance_attribute sets. The GUI however can't do that > automatically (though it probably wouldn't even try at this > stage) as only a human can figure out which one should be gone. > I'm not sure what is the benefit of having more than one set of > meta_attributes in a resource. This is not exactly a bug, but I > think that it deserves a bugzilla entry since it leads to a very > confusing and unexpected behaviour. Can you please file one? Your conclusions are more or less the same, I had. However, I'll create a bug report later. unfortunately, still no idea how it happened. We removed the duplicate entries, replacing the CIB (cibadmin -R -o resources), then everything was working normal again. We try to fiddle around a bit with the cluster, to try to reproduce the problem. When we figure out, what caused it, then I'll add this to the bug report too, but I'm not very optimistic about that yet ;) thanks Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] meta_attributes twice for some resources
Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA mailing list wrote: > Hi, > > I was wondering today why some of the resources in the cluster behaved > strangely, e.g. did not reacted on "start/stop/clean up" when clicking in > the GUI. Then I tried this with crm_resource, and it was whining because the > target_role matched twice, and it did not know, which one to use. So I took > a look at the CIB, and found sth. like below, for a bunch of resources: > > > > > value="/pps/sw/bin/PPS/Control-BoxThread"/> > name="dropbox_pid_file" value="/var/run/Dropbox/ArchiveFileHandler.pid"/> > value="4"/> > > > > > > > > value="stopped"/> > > > > > > note the double meta_attributes. I know, I've no configuration files here, > as I have no idea, when this happened in the last one or two days. I'm just > asking, maybe someone has seen sth. like this before, and maybe could share > the info what might have caused it? forgot to mention, I'm running heartbeat-2.1.2-28.1, on openSUSE 10.2 i586 Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] meta_attributes twice for some resources
Hi, I was wondering today why some of the resources in the cluster behaved strangely, e.g. did not reacted on "start/stop/clean up" when clicking in the GUI. Then I tried this with crm_resource, and it was whining because the target_role matched twice, and it did not know, which one to use. So I took a look at the CIB, and found sth. like below, for a bunch of resources: note the double meta_attributes. I know, I've no configuration files here, as I have no idea, when this happened in the last one or two days. I'm just asking, maybe someone has seen sth. like this before, and maybe could share the info what might have caused it? thanks Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] question regarding orderings in resource groups
Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote: > On 2008-02-19T15:49:28, Sebastian Reitenbach <[EMAIL PROTECTED]> wrote: > > > > Make rsc 'from' run on the same machine as rsc 'to' > > > > > > If rsc 'to' cannot run anywhere and 'score' is INFINITY, > > > then rsc 'from' wont be allowed to run anywhere either > > > If rsc 'from' cannot run anywhere, then 'to' wont be affected > > > > > > --> > > > > > > (You can force this to be bidirectional if you set symmetrical to true for > > > the > > > colocation constraint; I don't think you can set that for groups.) > > > > I am aware of that, thanks. But I wanted to use groups, to not need such a > > lot of constraints. > > Yeah, I agree. You'd need N:N-1 constraints to get what you want, which > probably wouldn't make you happy ;-) > > You could all colocate them with another resource (if there is one they > need to share; perhaps the fs?) This would reduce the number to N > constraints. > > Or, you could use a non-colocated, non-ordered group, and then define a > rsc_location rule to make them all run on the same node if available. I haven't tested this yet, because I only have a one node cluster here right now ;), However, when I try to create a location constraint via the GUI I can only select the group as a whole, but not the group members. When I select the group, will then the group members automatically kept on the same node, whatever happens? This would be just only one constraint. If so, then I don't really understand what the colocated parameter is good for, when I set it to false in that case, it would not make sense, and setting it to "yes", would be redundant. Then the collocated parameter to a group only makes sense when set to yes, but I have no preferences, where the group should run. > > Or, a colocation constraint from that group to the resource you want to > collocate with. I'm not sure this works. Would reduce the number to 1 > constraint. yeah, would be more or less the same as a location for the whole group, as above. > > Groups were meant as a short-hand for the most common case, and now > people find out other uses for them; we need to find ways how to make > the groups more powerful, or the constraints (to reduce the need for > more powerful groups). but what about the other thing I mentioned, is this then a bug? with the three resources in the collocated, unordered group. I've seen the seond and third resource stopping, when I shutdown the second, but the first still left running. On your explanation in the other mail, I'd expect the first being shutdown too, which just not happens. kind regards Sebastian > > > Regards, > Lars > > -- > Teamlead Kernel, SuSE Labs, Research and Development > SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) > "Experience is the name everyone gives to their mistakes." -- Oscar Wilde > > ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] question regarding orderings in resource groups
Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote: > On 2008-02-19T12:11:26, Sebastian Reitenbach <[EMAIL PROTECTED]> wrote: > > > there ordered is set to false. I have the group running, and when I then > > e.g. want to stop the resource D2, then D3 stops too. Only when I change > > collocated to false, then D3 keeps running when I stop D2. > > > > Seems to be not working as I understood it. Am I missing anything important > > here, or maybe just a bug? > > This is working as expected, I think. Because the resources are required > to be collocated, but you stopped one, the others also have to stop. > I understood the colocated parameter of a group as, when the resources run, then they have to run on the same host, when they not run, then they just not run, but not influence others. Howevery, but from your explanation, when I stop any resource in a colocated group, then all resoures have to stop in that group, not just only the resources in the list after the one I explicitly stopped. When I stopped the D2 resource, then D3 was stopped too, but D1 was kept running. > See the comment in the DTD: > > > > (You can force this to be bidirectional if you set symmetrical to true for the > colocation constraint; I don't think you can set that for groups.) I am aware of that, thanks. But I wanted to use groups, to not need such a lot of constraints. cheers Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] question regarding orderings in resource groups
Hi, as far as I understand groups, the parameter ordered means, when set to yes, that the resources in the group are started and stopped in the order that they appear in the CIB. The collocated parameter means, that when set to yes, all resources in a group run on the same cluster node. I just created the following resource group: there ordered is set to false. I have the group running, and when I then e.g. want to stop the resource D2, then D3 stops too. Only when I change collocated to false, then D3 keeps running when I stop D2. Seems to be not working as I understood it. Am I missing anything important here, or maybe just a bug? I'm on sles10sp1, using heartbeat 2.1.3. cheers Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problems with quorumd
Hi, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > Hi, > > On Thu, Feb 07, 2008 at 05:00:14PM +0100, Sebastian Reitenbach wrote: > > Hi, > > > > I have a 4 node cluster, and wanted to setup a quorum server, so that I do > > not need three running cluster nodes to get quorum. The quorumd IP address > > is a shared IP on another two node cluster. > > > > I've done the following tests, the quorumd from a 2.1.2 version of > > heartbeat, the cluster nodes had 2.1.3 version: > > > > > > > > start quorumd > > start first cluster node -> (node becomes DC, contacting the quorum) cluster > > gets quorm > > start second cluster node -> cluster still has quorum > > stop DC, -> see other node becoming DC, and contacting quorum server, > > cluster still has quorum > > kill quorumd, then see RST packets going back to cluster node (the DC tries > > to contact the quorumd every second) -> cluster still has quorum > > wait 5 minutes -> cluster still has quorum > > try to start stop a node, resource, add or remove a resource -> this works, > > then the cluster recognizes the lost quorum > > After any of these actions the cluster looses quorum? Or is it > just after the node restart? I added a dummy resource, at a time when the quorumd was not reachable, The resource got created. The defautl target role is stopped, so the Dummy was stopped. Before I was able to make the dummy active, the cluster recognized that it lost quorum and refused to make the Dummy active. > > > then restart the quorumd -> see answers going back from quorumd to DC node, > > but cluster has no quorum again > > wait 5 minutes -> cluster still has no quorum again > > I can recall that somebody else already complained about the same > issue. most likely me some months ago, fiddling around with 2.1.2 ;) > > > restart heartbeat on one of the cluster nodes -> cluster recognizes the > > availablility of quorumd and gets quorum again > > > > Setting a node to standby, does not make the cluster recognize that the > > quorum got lost, or is available again. > > > > I also have seen, when there is a firewall, that drops packets, instead of > > answering with RST, when the quorumd is down, then the rate when the DC > > tries to reconnect to the quorumd drops to about once a minute, but that is > > OK, as I'd guess its waiting for timeouts. > > Yes, looks like a TCP/IP property. > > > So in my eyes, using a quorumd does more harm than being useful, but ma > > did sth. wrong? > > Since it has been working, you probably set it up ok. You should > open a bugzilla for this. Sorry that I can't offer more help on > the matter now. > > BTW, did you also test a split brain situation where one of the > nodes can talk to the quorumd? no, I now decided, that I run the cluster without quorumd for now. Nevertheless, I'll create a bugzilla entry. cheers Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] problems with quorumd
Hi, I have a 4 node cluster, and wanted to setup a quorum server, so that I do not need three running cluster nodes to get quorum. The quorumd IP address is a shared IP on another two node cluster. I've done the following tests, the quorumd from a 2.1.2 version of heartbeat, the cluster nodes had 2.1.3 version: start quorumd start first cluster node -> (node becomes DC, contacting the quorum) cluster gets quorm start second cluster node -> cluster still has quorum stop DC, -> see other node becoming DC, and contacting quorum server, cluster still has quorum kill quorumd, then see RST packets going back to cluster node (the DC tries to contact the quorumd every second) -> cluster still has quorum wait 5 minutes -> cluster still has quorum try to start stop a node, resource, add or remove a resource -> this works, then the cluster recognizes the lost quorum then restart the quorumd -> see answers going back from quorumd to DC node, but cluster has no quorum again wait 5 minutes -> cluster still has no quorum again restart heartbeat on one of the cluster nodes -> cluster recognizes the availablility of quorumd and gets quorum again Setting a node to standby, does not make the cluster recognize that the quorum got lost, or is available again. I also have seen, when there is a firewall, that drops packets, instead of answering with RST, when the quorumd is down, then the rate when the DC tries to reconnect to the quorumd drops to about once a minute, but that is OK, as I'd guess its waiting for timeouts. So in my eyes, using a quorumd does more harm than being useful, but maybe I did sth. wrong? cheers Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 2.1.3 suse rpm's?
Hi Andrew, I just downloaded and configure the heartbeat rpm's from you for sles10 x86_64 from today. All mentioned problems fixed. But pam authentication to login with the gui to the mgmtd did still not worked. I had to change /etc/pam.d/hbmgmtd to this: #%PAM-1.0 [EMAIL PROTECTED] common-auth [EMAIL PROTECTED] common-account authrequiredpam_unix2.so account requiredpam_unix2.so then it worked fine. cheers Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 2.1.3 suse rpm's?
Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > On Jan 23, 2008, at 8:31 PM, Sebastian Reitenbach wrote: > > > Andrew Beekhof <[EMAIL PROTECTED]> wrote: > >> > >> On Jan 23, 2008, at 7:21 PM, Sebastian Reitenbach wrote: > >> > >>> Hi, > >>> Andrew Beekhof <[EMAIL PROTECTED]> wrote: > >>>> > >>>> it was that package - i've added it as a dependancy > >>>> > >>> I just installed on opensuse 10.3 i586, and pacemaker-pygui now > >>> requires the > >>> pyxml package. Nevertheless, to be able to start hb_gui I still need > >>> to make > >>> a symbolic link in /usr/lib/heartbeat-gui/ from _pymgmt.so.0 to > >>> _pymgmt.so. > >> > >> ok, i'll make sure that link gets created > >> > >> thanks for your help getting the kinks worked out - i've had very > >> little to do with the gui and would much prefer to pretend it doesn't > >> exist :-) > > > > yeah, but it is getting better and better with every release. > > true. > the problem was that IBM pulled all their people off the project at a > point when the GUI was barely usable. > the good news is that some of the folks from Novell China and NTT > Japan are getting involved and contributing some really good patches > for it. > > >>> Nevertheless, hb_gui is not much useful, as the mgmtd is not able to > >>> run, > >>> when starting mgmtd -v the following shows up in the logs: > >> > >> it looks like heartbeat doesn't like the user mgmtd is running as > >> > >> who are you running that command as? > > I was root, just ran mgmtd from commandline. > > ok, i see the problem... you need to add the following two lines to > ha.cf > > > apiauth mgmtd uid=root > respawn root/usr/lib/heartbeat/mgmtd -v > > > these used to be implied by the "crm yes" line but, according to the > logic in heartbeat.c, only when heartbeat is built with the mgmtd > (which it no longer is) I added these two lines, and now mgmtd starts. > > i'll add this to the FAQ > > >> what does your ha.cf look like? > > > > logfacility local0 > > crm yes > > cluster OpenBSD-Heartbeat > > udpport 694 > > ucast eth0 ops.ds9 > > ucast eth0 defiant.ds9 > > > auto_failback on > > btw. that line has no meaning in a crm/pacemaker cluster thanks for pointing out. cheers Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 2.1.3 suse rpm's?
Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > On Jan 23, 2008, at 7:21 PM, Sebastian Reitenbach wrote: > > > Hi, > > Andrew Beekhof <[EMAIL PROTECTED]> wrote: > >> > >> it was that package - i've added it as a dependancy > >> > > I just installed on opensuse 10.3 i586, and pacemaker-pygui now > > requires the > > pyxml package. Nevertheless, to be able to start hb_gui I still need > > to make > > a symbolic link in /usr/lib/heartbeat-gui/ from _pymgmt.so.0 to > > _pymgmt.so. > > ok, i'll make sure that link gets created > > thanks for your help getting the kinks worked out - i've had very > little to do with the gui and would much prefer to pretend it doesn't > exist :-) yeah, but it is getting better and better with every release. > > > > > Nevertheless, hb_gui is not much useful, as the mgmtd is not able to > > run, > > when starting mgmtd -v the following shows up in the logs: > > it looks like heartbeat doesn't like the user mgmtd is running as > > who are you running that command as? I was root, just ran mgmtd from commandline. > what does your ha.cf look like? logfacility local0 crm yes cluster OpenBSD-Heartbeat udpport 694 ucast eth0 ops.ds9 ucast eth0 defiant.ds9 auto_failback on nodeops nodedefiant.ds9 ping 10.0.0.1 use_logd yes I just copied the config from my OpenBSD box, where this works just fine. > > > Jan 23 19:19:17 ops mgmtd: [29798]: info: G_main_add_SignalHandler: > > Added > > signal handler for signal 15 > > Jan 23 19:19:17 ops mgmtd: [29798]: debug: Enabling coredumps > > Jan 23 19:19:17 ops mgmtd: [29798]: WARN: Core dumps could be lost if > > multiple dumps occur. > > Jan 23 19:19:17 ops mgmtd: [29798]: WARN: Consider setting non- > > default value > > in /proc/sys/kernel/core_pattern (or equivalent) for maximum > > supportability > > Jan 23 19:19:17 ops mgmtd: [29798]: WARN: Consider > > setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for > > maximum > > supportability > > Jan 23 19:19:17 ops mgmtd: [29798]: info: G_main_add_SignalHandler: > > Added > > signal handler for signal 10 > > Jan 23 19:19:17 ops mgmtd: [29798]: info: G_main_add_SignalHandler: > > Added > > signal handler for signal 12 > > Jan 23 19:19:17 ops mgmtd: [29798]: ERROR: Cannot sign on with > > heartbeat > > Jan 23 19:19:17 ops mgmtd: [29798]: ERROR: REASON: > > Jan 23 19:19:17 ops mgmtd: [29798]: ERROR: Can't initialize management > > library.Shutting down.(-1) > > Jan 23 19:19:17 ops heartbeat: [29647]: WARN: Client [mgmtd] pid 29798 > > failed authorization [no default client auth] > > Jan 23 19:19:17 ops heartbeat: [29647]: ERROR: > > api_process_registration_msg: > > cannot add client(mgmtd) > > > > This also happened on the SLES10. tss, accidently sent it to me instead of the list. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 2.1.3 suse rpm's?
Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > On Jan 22, 2008, at 9:31 AM, Sebastian Reitenbach wrote: > > > Hi, > > > > General Linux-HA mailing list wrote: > >> > >> On Jan 21, 2008, at 5:09 PM, Andrew Beekhof wrote: > >> > >>> > >>> On Jan 21, 2008, at 3:52 PM, matilda matilda wrote: > >>> > >>>>>>> "Sebastian Reitenbach" <[EMAIL PROTECTED]> > >>>>>>> 21.01.2008 15:21 >>> > >>>>> yes, that helped, I installed both rpm's, but now, when I want to > >>>>> start the > >>>>> hb_gui, I get the following error: > >>>>> > >>>>> Traceback (most recent call last): > >>>>> File "/usr/bin/hb_gui", line 35, in ? > >>>>> from pymgmt import * > >>>>> ImportError: No module named pymgmt > >>>>> > >>>>> These files are installed: > >>>>> find /usr/ -name "*pymgmt*" > >>>>> /usr/lib64/heartbeat-gui/pymgmt.pyc > >>>>> /usr/lib64/heartbeat-gui/_pymgmt.so.0 > >>>>> /usr/lib64/heartbeat-gui/_pymgmt.so.0.0.0 > >>>>> /usr/lib64/heartbeat-gui/pymgmt.py > >>>>> I also created a symbolic link from _pymgmt.so.0 to _pymgmt.so, > >>>>> because I > >>>>> found a _pymgmt.so file in the same directory, that was installed > >>>>> on another > >>>>> SLES box with heartbeat 2.1.2, but that did not helped. > >>>>> so I removed the 2.1.3 rpm's and installed the 2.1.2, and the > >>>>> hb_gui is > >>>>> working on that box, so at least no basic python stuff seems to be > >>>>> missing. > >>>>> > >>>>> So there must still sth. missing to get the GUI working again, any > >>>>> more > >>>>> idea? > >>>> > >>>> Hi Sebastian, hi Andrew, hi all, > >>>> > >>>> I looked at /usr/bin/hb_gui of version 2.1.3 packed by Andrew. > >>>> If you look at line 33,34,35 you'll see that the build process > >>>> didn't replace the build environment variables @HA_DATADIR@ > >>>> and @HA_LIBDIR@ by their values. > >>> > >>> ah, well spotted. i'll get them fixed. > >> > >> pushing up some new packages now - give them a moment to rebuild > >> (fyi: Fedora x86_64 is currently not able to build due to a build > >> service problem - just grab an i386 src.rpm and do a rpm rebuild) > >> > > Ok, I just tried these rpm's: > > pacemaker-pygui-1.1-2.1 > > pacemaker-heartbeat-0.6.0-15.1 > > heartbeat-common-2.1.3-3.2 > > heartbeat-resources-2.1.3-3.2 > > heartbeat-ldirectord-2.1.3-3.1 > > heartbeat-2.1.3-3.2 > > > > now I get the following error message when I try to start hb_gui: > > hb_gui > > Traceback (most recent call last): > > File "/usr/bin/hb_gui", line 29, in ? > >from xml.parsers.xmlproc.xmldtd import load_dtd_string > > ImportError: No module named xmlproc.xmldtd > > I think you need the pyxml package for this > > Can you confirm that for me? If so I'll add it to the spec file as a > dependancy. I had no time to check today, I had to downgrade to 2.1.2 yesterday, to keep me going. What I can say is that pyxml was not installed. The 2.1.2 is working, so this must be a new dependency then, but I think there were some changes to the GUI, regarding parsing the dtd, so you might be right. I hope I find some time tomorrow to retest. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 2.1.3 suse rpm's?
Hi, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > On Jan 22, 2008, at 9:35 AM, Sebastian Reitenbach wrote: > > > deinstalling heartbeat-2.1.3-3.2 only works with --noscripts because > > of the > > following error: > > /usr/lib64/heartbeat/heartbeat: error while loading shared libraries: > > libstonith.so.1: cannot open shared object file: No such file or > > directory > > ..failed > > error: %preun(heartbeat-2.1.3-3.2.x86_64) scriptlet failed, exit > > status 1 > > hmmm > > I see > %restart_on_update heartbeat > > in the postun section which looks suspicious, but preun looks sane > enough: > > %preun > %if 0%{?suse_version} >%stop_on_removal heartbeat > %endif > %if 0%{?fedora_version} >/sbin/chkconfig --del heartbeat > %endif > > > was heartbeat-common still installed at the time? > first I tried to deinstall all together via rpm -e ... then it failed for the heartbeat- package, the rest was not installed anymore. Then I retried to deinstall heartbeat- again, but it failed again with above error. Then only a rpm -e --noscripts helped. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 2.1.3 suse rpm's?
Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA mailing list wrote: > Hi, > > General Linux-HA mailing list wrote: > > > > On Jan 21, 2008, at 5:09 PM, Andrew Beekhof wrote: > > > > > > > > On Jan 21, 2008, at 3:52 PM, matilda matilda wrote: > > > > > >>>>> "Sebastian Reitenbach" <[EMAIL PROTECTED]> > > >>>>> 21.01.2008 15:21 >>> > > >>> yes, that helped, I installed both rpm's, but now, when I want to > > >>> start the > > >>> hb_gui, I get the following error: > > >>> > > >>> Traceback (most recent call last): > > >>> File "/usr/bin/hb_gui", line 35, in ? > > >>> from pymgmt import * > > >>> ImportError: No module named pymgmt > > >>> > > >>> These files are installed: > > >>> find /usr/ -name "*pymgmt*" > > >>> /usr/lib64/heartbeat-gui/pymgmt.pyc > > >>> /usr/lib64/heartbeat-gui/_pymgmt.so.0 > > >>> /usr/lib64/heartbeat-gui/_pymgmt.so.0.0.0 > > >>> /usr/lib64/heartbeat-gui/pymgmt.py > > >>> I also created a symbolic link from _pymgmt.so.0 to _pymgmt.so, > > >>> because I > > >>> found a _pymgmt.so file in the same directory, that was installed > > >>> on another > > >>> SLES box with heartbeat 2.1.2, but that did not helped. > > >>> so I removed the 2.1.3 rpm's and installed the 2.1.2, and the > > >>> hb_gui is > > >>> working on that box, so at least no basic python stuff seems to be > > >>> missing. > > >>> > > >>> So there must still sth. missing to get the GUI working again, any > > >>> more > > >>> idea? > > >> > > >> Hi Sebastian, hi Andrew, hi all, > > >> > > >> I looked at /usr/bin/hb_gui of version 2.1.3 packed by Andrew. > > >> If you look at line 33,34,35 you'll see that the build process > > >> didn't replace the build environment variables @HA_DATADIR@ > > >> and @HA_LIBDIR@ by their values. > > > > > > ah, well spotted. i'll get them fixed. > > > > pushing up some new packages now - give them a moment to rebuild > > (fyi: Fedora x86_64 is currently not able to build due to a build > > service problem - just grab an i386 src.rpm and do a rpm rebuild) > > > Ok, I just tried these rpm's: > pacemaker-pygui-1.1-2.1 > pacemaker-heartbeat-0.6.0-15.1 > heartbeat-common-2.1.3-3.2 > heartbeat-resources-2.1.3-3.2 > heartbeat-ldirectord-2.1.3-3.1 > heartbeat-2.1.3-3.2 > > now I get the following error message when I try to start hb_gui: > hb_gui > Traceback (most recent call last): > File "/usr/bin/hb_gui", line 29, in ? > from xml.parsers.xmlproc.xmldtd import load_dtd_string > ImportError: No module named xmlproc.xmldtd > > I am on SLES10SP1 x86_64 deinstalling heartbeat-2.1.3-3.2 only works with --noscripts because of the following error: /usr/lib64/heartbeat/heartbeat: error while loading shared libraries: libstonith.so.1: cannot open shared object file: No such file or directory ..failed error: %preun(heartbeat-2.1.3-3.2.x86_64) scriptlet failed, exit status 1 Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 2.1.3 suse rpm's?
Hi, General Linux-HA mailing list wrote: > > On Jan 21, 2008, at 5:09 PM, Andrew Beekhof wrote: > > > > > On Jan 21, 2008, at 3:52 PM, matilda matilda wrote: > > > >>>>> "Sebastian Reitenbach" <[EMAIL PROTECTED]> > >>>>> 21.01.2008 15:21 >>> > >>> yes, that helped, I installed both rpm's, but now, when I want to > >>> start the > >>> hb_gui, I get the following error: > >>> > >>> Traceback (most recent call last): > >>> File "/usr/bin/hb_gui", line 35, in ? > >>> from pymgmt import * > >>> ImportError: No module named pymgmt > >>> > >>> These files are installed: > >>> find /usr/ -name "*pymgmt*" > >>> /usr/lib64/heartbeat-gui/pymgmt.pyc > >>> /usr/lib64/heartbeat-gui/_pymgmt.so.0 > >>> /usr/lib64/heartbeat-gui/_pymgmt.so.0.0.0 > >>> /usr/lib64/heartbeat-gui/pymgmt.py > >>> I also created a symbolic link from _pymgmt.so.0 to _pymgmt.so, > >>> because I > >>> found a _pymgmt.so file in the same directory, that was installed > >>> on another > >>> SLES box with heartbeat 2.1.2, but that did not helped. > >>> so I removed the 2.1.3 rpm's and installed the 2.1.2, and the > >>> hb_gui is > >>> working on that box, so at least no basic python stuff seems to be > >>> missing. > >>> > >>> So there must still sth. missing to get the GUI working again, any > >>> more > >>> idea? > >> > >> Hi Sebastian, hi Andrew, hi all, > >> > >> I looked at /usr/bin/hb_gui of version 2.1.3 packed by Andrew. > >> If you look at line 33,34,35 you'll see that the build process > >> didn't replace the build environment variables @HA_DATADIR@ > >> and @HA_LIBDIR@ by their values. > > > > ah, well spotted. i'll get them fixed. > > pushing up some new packages now - give them a moment to rebuild > (fyi: Fedora x86_64 is currently not able to build due to a build > service problem - just grab an i386 src.rpm and do a rpm rebuild) > Ok, I just tried these rpm's: pacemaker-pygui-1.1-2.1 pacemaker-heartbeat-0.6.0-15.1 heartbeat-common-2.1.3-3.2 heartbeat-resources-2.1.3-3.2 heartbeat-ldirectord-2.1.3-3.1 heartbeat-2.1.3-3.2 now I get the following error message when I try to start hb_gui: hb_gui Traceback (most recent call last): File "/usr/bin/hb_gui", line 29, in ? from xml.parsers.xmlproc.xmldtd import load_dtd_string ImportError: No module named xmlproc.xmldtd I am on SLES10SP1 x86_64 Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: Re: [Linux-HA] 2.1.3 suse rpm's?
"matilda matilda" <[EMAIL PROTECTED]> wrote: > >>> "Sebastian Reitenbach" <[EMAIL PROTECTED]> 21.01.2008 15:21 >>> > > yes, that helped, I installed both rpm's, but now, when I want to start the > > hb_gui, I get the following error: > > > >Traceback (most recent call last): > > File "/usr/bin/hb_gui", line 35, in ? > >from pymgmt import * > >ImportError: No module named pymgmt > > > >These files are installed: > >find /usr/ -name "*pymgmt*" > >/usr/lib64/heartbeat-gui/pymgmt.pyc > >/usr/lib64/heartbeat-gui/_pymgmt.so.0 > >/usr/lib64/heartbeat-gui/_pymgmt.so.0.0.0 > >/usr/lib64/heartbeat-gui/pymgmt.py > >I also created a symbolic link from _pymgmt.so.0 to _pymgmt.so, because I > >found a _pymgmt.so file in the same directory, that was installed on another > >SLES box with heartbeat 2.1.2, but that did not helped. > >so I removed the 2.1.3 rpm's and installed the 2.1.2, and the hb_gui is > >working on that box, so at least no basic python stuff seems to be missing. > > > >So there must still sth. missing to get the GUI working again, any more > >idea? > > Hi Sebastian, hi Andrew, hi all, > > I looked at /usr/bin/hb_gui of version 2.1.3 packed by Andrew. > If you look at line 33,34,35 you'll see that the build process > didn't replace the build environment variables @HA_DATADIR@ > and @HA_LIBDIR@ by their values. > Without that python inlcude path the modules necessary for > the rest are not found. That's the reason for the import error. > Version 2.1.2 does have for 32bit: > -8< > sys.path.append("/usr/share/heartbeat-gui") > sys.path.append("/usr/lib/heartbeat-gui") > from pymgmt import * > -8< > > Thanks for that hint, below these lines, I found a lot more @HA_DATADIR@, replacing all of them, and replacing the @HA_LIBDIR@ with /usr/lib64. I also had to create a symbolic link in /usr/lib64/heartbeat-gui, from _pymgmt.so.0 to _pymgmt.so, after doing all that, the GUI started up. cheers Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 2.1.3 suse rpm's?
Hi, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > On Jan 21, 2008, at 12:52 PM, Sebastian Reitenbach wrote: > > > Hi, > > > > Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA > > mailing list wrote: > >> Hi, > >> Andrew Beekhof <[EMAIL PROTECTED]> wrote: > >>> > >>> On Jan 10, 2008, at 6:16 PM, Sebastian Reitenbach wrote: > >>> > >>>> Hi, > >>>> > >>>> on the download area, there are the pointers to the suse build > >>>> service, > >>>> providing rpms for opensuse versions and others. But there is still > >>>> only > >>>> heartbeat-2.1.2- sth. > >>>> Is this intentional or is there sth. wrong? > >>> > >>> they're pretty close to what ended up in 2.1.3 > >>> i'll update them shortly when I do the first pacemaker release > >>> > >> ah, ok, that's fine. > > I just wanted use the rpm packages for heartbeat 2.1.3 on SLES10SP1. > > > > I installed these rpm packets: > > heartbeat-2.1.3-3.1.x86_64.rpm > > heartbeat-ldirectord-2.1.3-3.1.x86_64.rpm > > heartbeat-common-2.1.3-3.1.x86_64.rpm > > heartbeat-resources-2.1.3-3.1.x86_64.rpm > > > > and had to find out, that the heartbeat-gui and the crm stuff, and > > maybe > > more, seems to be missing. Is that intentionally left out, as the > > rpm names > > also changed a bit? I thought the heartbeat 2.1.3 version is the > > last one > > where the crm is still in heartbeat? > > Now that Pacemaker 0.6.0 is out, the built-in CRM is no longer > supported (all bugs will be fixed in Pacemaker). > Thus the Heartbeat packages on the build service are built without the > built-in CRM. > Check the changelog in the Heartbeat package for exactly what is no > longer included. yeah, my fault to not read that file ;) > > One thing I like about .deb is that you can recommend other packages > to install. > This would have alerted to you that something was missing - alas there > is no such mechanism for rpm. > > > Do I have to install the pacemaker-* and openais* rpm's to get the > > functionality back? How well is the pacemaker-* stuff tested, is > > that ready > > for production systems, or should I better stay with a heartbeat > > 2.1.2 or > > the versions that install with SLES10SP1? > > If you only wish to use the heartbeat stack, you only need one extra > rpm: pacemaker-heartbeat yeah, just basically the same functionality as with 2.1.2 is wanted. > > It contains essentially the same CRM code that was in 2.1.3 and is no > more/less production ready than what was in 2.1.3. > You can see the testing criteria for releases at: > http://www.clusterlabs.org/mw/Release_Testing > > For the GUI, you'll need the pacemaker-pygui package. > > The list of packages is described at: > http://www.clusterlabs.org/mw/Install#Package_List > > hope that helps yes, that helped, I installed both rpm's, but now, when I want to start the hb_gui, I get the following error: Traceback (most recent call last): File "/usr/bin/hb_gui", line 35, in ? from pymgmt import * ImportError: No module named pymgmt These files are installed: find /usr/ -name "*pymgmt*" /usr/lib64/heartbeat-gui/pymgmt.pyc /usr/lib64/heartbeat-gui/_pymgmt.so.0 /usr/lib64/heartbeat-gui/_pymgmt.so.0.0.0 /usr/lib64/heartbeat-gui/pymgmt.py I also created a symbolic link from _pymgmt.so.0 to _pymgmt.so, because I found a _pymgmt.so file in the same directory, that was installed on another SLES box with heartbeat 2.1.2, but that did not helped. so I removed the 2.1.3 rpm's and installed the 2.1.2, and the hb_gui is working on that box, so at least no basic python stuff seems to be missing. So there must still sth. missing to get the GUI working again, any more idea? cheers Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 2.1.3 suse rpm's?
Hi, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > Hi, > > On Mon, Jan 21, 2008 at 12:52:37PM +0100, Sebastian Reitenbach wrote: > > Hi, > > > > Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA > > mailing list wrote: > > > Hi, > > > Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > > > > > > > On Jan 10, 2008, at 6:16 PM, Sebastian Reitenbach wrote: > > > > > > > > > Hi, > > > > > > > > > > on the download area, there are the pointers to the suse build > > > > > service, > > > > > providing rpms for opensuse versions and others. But there is still > > > > > only > > > > > heartbeat-2.1.2- sth. > > > > > Is this intentional or is there sth. wrong? > > > > > > > > they're pretty close to what ended up in 2.1.3 > > > > i'll update them shortly when I do the first pacemaker release > > > > > > > ah, ok, that's fine. > > I just wanted use the rpm packages for heartbeat 2.1.3 on SLES10SP1. > > > > I installed these rpm packets: > > heartbeat-2.1.3-3.1.x86_64.rpm > > heartbeat-ldirectord-2.1.3-3.1.x86_64.rpm > > heartbeat-common-2.1.3-3.1.x86_64.rpm > > heartbeat-resources-2.1.3-3.1.x86_64.rpm > > > > and had to find out, that the heartbeat-gui and the crm stuff, and maybe > > more, seems to be missing. Is that intentionally left out, as the rpm names > > also changed a bit? > > The gui should be in a separate package. Isn't there such a > package? http://download.opensuse.org/repositories/server:/ha-clustering/SLES_10/x86_64/ there is only a pacemaker-pygui-1.0.0-5.4.x86_64.rpm. I downloaded it, and wanted to install via rpm, but then libcrm, and two others are missing. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 2.1.3 suse rpm's?
Hi, Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA mailing list wrote: > Hi, > Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > > > On Jan 10, 2008, at 6:16 PM, Sebastian Reitenbach wrote: > > > > > Hi, > > > > > > on the download area, there are the pointers to the suse build > > > service, > > > providing rpms for opensuse versions and others. But there is still > > > only > > > heartbeat-2.1.2- sth. > > > Is this intentional or is there sth. wrong? > > > > they're pretty close to what ended up in 2.1.3 > > i'll update them shortly when I do the first pacemaker release > > > ah, ok, that's fine. I just wanted use the rpm packages for heartbeat 2.1.3 on SLES10SP1. I installed these rpm packets: heartbeat-2.1.3-3.1.x86_64.rpm heartbeat-ldirectord-2.1.3-3.1.x86_64.rpm heartbeat-common-2.1.3-3.1.x86_64.rpm heartbeat-resources-2.1.3-3.1.x86_64.rpm and had to find out, that the heartbeat-gui and the crm stuff, and maybe more, seems to be missing. Is that intentionally left out, as the rpm names also changed a bit? I thought the heartbeat 2.1.3 version is the last one where the crm is still in heartbeat? Do I have to install the pacemaker-* and openais* rpm's to get the functionality back? How well is the pacemaker-* stuff tested, is that ready for production systems, or should I better stay with a heartbeat 2.1.2 or the versions that install with SLES10SP1? thanks Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cl_status listnodes does not honour -n or -p onOpenBSD
Hi, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > Hi, > > On Wed, Jan 16, 2008 at 10:23:17AM +0100, Sebastian Reitenbach wrote: > > Hi, > > > > I wanted to use cl_status to get a list of the cluster nodes, without the > > ping nodes in the list on OpenBSD, but unfortunately, > > cl_status listnodes -n > > shows the ping nodes too, also with -p parameter, all nodes are shown. > > I use heartbeat 2.1.3 on OpenBSD. > > I tried the same on a SLES10 with heartbeat 2.1.2 installed. There this > > command is working as documented. > > Is this generally working with heartbeat 2.1.3, or is it just a problem with > > OpenBSD? > > anybody could test on a different OS with HB 2.1.3 and let me know whether > > it works. > > On openSUSE 10.3 it works without problems. Looks like it's > OpenBSD specific. Thanks a lot for testing, then its my problem. I'll open a bug report and will try to fix. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] cl_status listnodes does not honour -n or -p on OpenBSD
Hi, I wanted to use cl_status to get a list of the cluster nodes, without the ping nodes in the list on OpenBSD, but unfortunately, cl_status listnodes -n shows the ping nodes too, also with -p parameter, all nodes are shown. I use heartbeat 2.1.3 on OpenBSD. I tried the same on a SLES10 with heartbeat 2.1.2 installed. There this command is working as documented. Is this generally working with heartbeat 2.1.3, or is it just a problem with OpenBSD? anybody could test on a different OS with HB 2.1.3 and let me know whether it works. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] 2.1.3 suse rpm's?
Hi, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > On Jan 10, 2008, at 6:16 PM, Sebastian Reitenbach wrote: > > > Hi, > > > > on the download area, there are the pointers to the suse build > > service, > > providing rpms for opensuse versions and others. But there is still > > only > > heartbeat-2.1.2- sth. > > Is this intentional or is there sth. wrong? > > they're pretty close to what ended up in 2.1.3 > i'll update them shortly when I do the first pacemaker release > ah, ok, that's fine. thanks Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] 2.1.3 suse rpm's?
Hi, on the download area, there are the pointers to the suse build service, providing rpms for opensuse versions and others. But there is still only heartbeat-2.1.2- sth. Is this intentional or is there sth. wrong? Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] question regarding quorumd
Hi, Zhen Huang <[EMAIL PROTECTED]> wrote: > Hi, > > The DC node should try to connect to the quorumd sever periodically. > If not, it should be a bug. I observed this behavior first on a two node Linux cluster. I just did some more tests with a two node OpenBSD cluster, and the quorumd on a Linux box. The following I observed, test 1: - configure usage of quorumd on the two heartbeat nodes - start quorumd on the Linux node - start the first cluster node - this is starting communication with quorumd, it gets quorum, and I can start managing resources - start the second cluster node, and everything is still working well - stop the quorumd - the DC still sends packets to the quorumd, for about a minute, then stops and never starts again, also the other node, does not start trying to contact the quorumd - then kill one of the cluster nodes, then the remaining node tries to contact the quorumd, fails because it is not running, and the left node is without quorum Test 2: - configure usage of quorumd on the two heartbeat nodes - do NOT start quorumd on the Linux node - start the first cluster node, see it failing to contact quorumd, it is starting up the cluster without quorum (it only sends one packet to the quorumd, receives a RST package, and seems to never try again) - start the second cluster node, this seems to trigger the DC to retry contacting the quorumd, (again, only one package, then nothing more) - both cluster nodes then together decide that the cluster runs without quorum. Shouldn't the two cluster nodes be enough to aquire quorum? - start the quorumd on the Linux box - wait forever, see that the cluster nodes not try to contact the quorumd again, therefore the cluster keeps thinking, it has no quorum at all. As said, last week I observed that initially on a two node Linux test cluster with a third node running a quorumd, so it not seems to be OS related. kind regards Sebastian > > Sebastian Reitenbach wrote: > > Hi, > > > > Andrew Beekhof <[EMAIL PROTECTED]> wrote: > >> On Nov 13, 2007, at 11:13 AM, Sebastian Reitenbach wrote: > >> > >>> Hi, > >>> > >>> Andrew Beekhof <[EMAIL PROTECTED]> wrote: > >>>> On Nov 9, 2007, at 4:34 PM, Sebastian Reitenbach wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> I did some tests with a two node cluster and a third one running a > >>>>> quorumd. > >>>>> > >>>>> I started the quorumd, and then the two cluster nodes. > >>>>> The one that became DC, started to communicate with the remote > >>>>> quorumd. > >>>> The CRM (and thus the "DC") doesn't know anything about quorumd > >>>> I believe this is purely the domain of the CCM and I've no idea how > >>>> that works :-) > >>>> > >>>> We just consume membership data from it... > >>>> > >>>> So anyway, my point is that the fact that a node is the DC is > >>>> irrelevant when it comes to quorumd. > >>> but somehow the cluster knows, as only the DC is communicating with > >>> the > >>> external quorumd. > >> I think that its just a co-incidence that it happens to be the DC... > >> at least I hope it is. > > I thought I read somewhere, that the DC is the one in charge of > > communicating with the remote quorumd, but I may be wrong here. > > > >>> I just do not understand, why the cluster does not retry > >>> to re-contact the quorumd after it lost connection to it. This was > >>> what I > >>> assumed, after a disconnect to the remote quorumd, the cluster nodes > >>> should > >>> try to contact it, and when the contact is there again, use it again. > >> I agree - but I've never seen that code. You'll have to contact alan > >> or file a bug for him. > > Alan, in case you think this is a bug, I'll go create a bug report for > it. > > Please let me know. > > > >>>>> I killed the DC, saw the other becoming DC, and start communicating > >>>>> to the remote quorumd, all fine, cluster still with quorum. > >>>>> Then I killed the quorumd itself, the DC recognized, and started to > >>>>> stop > >>>>> all resource, because of the quorum_policy, as it lost quorum. > >>>>> > >>>>> Then I restarted the quorumd again, but the DC, still without > >>>>> quorum, > >>>>> did not tried to communicate to the
Re: [Linux-HA] question regarding quorumd
Hi, Zhen Huang <[EMAIL PROTECTED]> wrote: > Hi, > > The DC node should try to connect to the quorumd sever periodically. > If not, it should be a bug. Thanks for clarifying, I'll retest later today when I'm back at home, when I can reproduce, I'll open a bugzilla entry. kind regards Sebastian > > > Alan Robertson <[EMAIL PROTECTED]> > 11/14/2007 03:13 AM > > To > Sebastian Reitenbach <[EMAIL PROTECTED]> > cc > linux-ha@lists.linux-ha.org, Zhen Huang/China/[EMAIL PROTECTED] > Subject > Re: [Linux-HA] question regarding quorumd > > > > > > > Sebastian Reitenbach wrote: > > Hi, > > > > Andrew Beekhof <[EMAIL PROTECTED]> wrote: > >> On Nov 13, 2007, at 11:13 AM, Sebastian Reitenbach wrote: > >> > >>> Hi, > >>> > >>> Andrew Beekhof <[EMAIL PROTECTED]> wrote: > >>>> On Nov 9, 2007, at 4:34 PM, Sebastian Reitenbach wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> I did some tests with a two node cluster and a third one running a > >>>>> quorumd. > >>>>> > >>>>> I started the quorumd, and then the two cluster nodes. > >>>>> The one that became DC, started to communicate with the remote > >>>>> quorumd. > >>>> The CRM (and thus the "DC") doesn't know anything about quorumd > >>>> I believe this is purely the domain of the CCM and I've no idea how > >>>> that works :-) > >>>> > >>>> We just consume membership data from it... > >>>> > >>>> So anyway, my point is that the fact that a node is the DC is > >>>> irrelevant when it comes to quorumd. > >>> but somehow the cluster knows, as only the DC is communicating with > >>> the > >>> external quorumd. > >> I think that its just a co-incidence that it happens to be the DC... > >> at least I hope it is. > > I thought I read somewhere, that the DC is the one in charge of > > communicating with the remote quorumd, but I may be wrong here. > > > >>> I just do not understand, why the cluster does not retry > >>> to re-contact the quorumd after it lost connection to it. This was > >>> what I > >>> assumed, after a disconnect to the remote quorumd, the cluster nodes > >>> should > >>> try to contact it, and when the contact is there again, use it again. > >> I agree - but I've never seen that code. You'll have to contact alan > >> or file a bug for him. > > Alan, in case you think this is a bug, I'll go create a bug report for > it. > > Please let me know. > > > >>>>> I killed the DC, saw the other becoming DC, and start communicating > >>>>> to the remote quorumd, all fine, cluster still with quorum. > >>>>> Then I killed the quorumd itself, the DC recognized, and started to > >>>>> stop > >>>>> all resource, because of the quorum_policy, as it lost quorum. > >>>>> > >>>>> Then I restarted the quorumd again, but the DC, still without > >>>>> quorum, > >>>>> did not tried to communicate to the quorumd again. > >>>>> I'd expect the still living DC to try to contact the quorumd, in > >>>>> case it > >>>>> comes back. > >>>>> > >>>>> If there is a good reason, why the DC is not trying to reconnect to > >>>>> the > >>>>> remote quorumd I'd really like to get enlightened from someone who > >>>>> knows. > > It should be trying to reconnect. It _does_ communicate w/quorumd from > a single machine/cluster. I think that it's coincidence that it's the > DC. Huang Zhen wrote the code. I've CCed him. I'm at the LISA > conference this week - if HZ doesn't get back to you by next Monday, > I'll look into it. > ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Xen & HA-clustering
On Thursday 15 November 2007 03:28:57 sadegh wrote: > Hi All, > How I can add xen to an HA-Cluster? I use some SAN devices, presented to my cluster nodes, and have xen instances configured to live on it. Then you just only need to add a Xen resource to your cluster. > what is your idea about changing failover mechanism from stop/restart to > live-migration? I played a bit with live migration in linux-ha, which works in general, but has some issues, nevertheless, start/stop takes about 45 seconds, migration takes about 30 seconds. Migration does not work in case of a failover, so would only be useful in a mainenance time. In my eyes, not much gained with live migration. you might want to try the updated Xen resource script for linux-ha: http://developerbugs.linux-foundation.org//show_bug.cgi?id=1778 It will allow you to monitor services within the Xen domU, and allow some simple memory management when you start/stop a domU. Comments and test reports are welcome. > very appreciate to have answer from you! > Best Regards > Sadegh Hooshmand Kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ECCN classification for Linux-HA Heartbeat
Hi, General Linux-HA mailing list wrote: > On 2007-11-13T14:18:50, "Henriques, Tiago" <[EMAIL PROTECTED]> wrote: > > > We are using Linux-HA Heartbeat in one of our products, and are now in > > the process of collecting the information needed to export it to other > > countries. > > > > In order to do this, can you tell me whether any citizens of the United > > States of America or people living in the U.S.A. have contributed to the > > Linux-HA Heartbeat software? > > Yes, heavily. US businesses, too. > > > Can you also tell me what the U.S. Export Control Classification Number > > (ECCN) for Linux-HA Heartbeat is, and whether a license exception may be > > used for it? > > No idea. My very limitted understanding is that this is merely a > component and requires an aggregate ECCN. > > Heartbeat itself does not appear to be subject to any special export > restrictions from the US, as it doesn't use nor provide encryption (just > digital signatures). Communication between cluster and quorumd requires to use X509 certificates. Don't know wheter that matters for you. > > I'd recommend that asking a lawyer is the best path forward. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] migration/fence after fail-count > X
Hi, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > On Nov 13, 2007, at 1:02 PM, Sebastian Reitenbach wrote: > > > Hi, > > > > I read in the v2 FAQ the following: > > > > What happens when monitor detects the resource down? > > The node will try to restart the resource, but if this fails, it > > will fail > > over to an other node. > > A feature that allows failover after N failures in a given period of > > time is > > planned. > > > > Is that feature still planned? > > thats how it works already - sort of. > there is a layer of indirection with resource-failcount-stickiness, > but basically once failcount hits a threshold - the resource moves. > > knowing what to set resource-failcount-stickiness to can be tricky. > one of the easiest, i can turn my brain off, ways is: > 1) to start the cluster and make sure everything is running > 2) figure out the current score (see conversations regarding the > getscores.sh script that has been posted here) Ah, I need to look for that. > 3) divide said score by X and add 1 > > > Could it also be instead of failover, fence the node X when > > failcount > X? > > no, at least not yet anyway > > interesting idea though I think that would be a viable option for resources that could get damaged or produce confusion, when started multiple times in a cluster, e.g. Xen domU's, non cluster aware Filesystems, IP addresses... > > > Or is that working already, and the FAQ is not upated? > > At least when I see this: > > http://www.linux-ha.org/v2/faq/forced_failover > > It seems to work already, but only in combination with moving a > > resource to > > another location, but not to be used to fence a node after a critical > > fail-count is reached. > > I've seen the fail_count utility, and tried to find examples on the > > webpage, > > but that search was not too exhaustive. > > > > Also, can the fail-count of different resources be summed up to make a > > decision in combination with fencing? E.g. Resources A, B, C... > > The failcount of A=3, + B=4 = SUM=7 > 6, then fecnce the node where > > that > > limit is reached. > > as above. not at the moment > Thanks for the input. I'll open some enhancement requests in the bugzilla later today for the two not possible ways. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] migration/fence after fail-count > X
Hi, I read in the v2 FAQ the following: What happens when monitor detects the resource down? The node will try to restart the resource, but if this fails, it will fail over to an other node. A feature that allows failover after N failures in a given period of time is planned. Is that feature still planned? Could it also be instead of failover, fence the node X when failcount > X? Or is that working already, and the FAQ is not upated? At least when I see this: http://www.linux-ha.org/v2/faq/forced_failover It seems to work already, but only in combination with moving a resource to another location, but not to be used to fence a node after a critical fail-count is reached. I've seen the fail_count utility, and tried to find examples on the webpage, but that search was not too exhaustive. Also, can the fail-count of different resources be summed up to make a decision in combination with fencing? E.g. Resources A, B, C... The failcount of A=3, + B=4 = SUM=7 > 6, then fecnce the node where that limit is reached. Kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] log warnings, but when I check no error seems to be there
Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > On Nov 13, 2007, at 10:36 AM, Sebastian Reitenbach wrote: > > > Hi, > > > > I see a lot of these messages in my logfile: > > > > pengine[12757]: 2007/11/13_10:27:02 WARN: process_pe_message: > > Transition > > 7687: WARNINGs found during PE processing. PEngine Input stored > > in: /var/lib/heartbeat/pengine/pe-warn-8072.bz2 > > pengine[12757]: 2007/11/13_10:27:02 info: process_pe_message: > > Configuration > > WARNINGs found during PE processing. Please run "crm_verify -L" to > > identify > > issues. > > > > but when I check crm_verify -L then nothing shows up, I also did a: > > bzcat /var/lib/heartbeat/pengine/pe-warn-8072.bz2 | crm_verify -p > > > > this command also produced no output. > > > > I am in a two node cluster, where one node is stopped, maybe that is > > the > > reason? > > What else could I do to figure out what the cluster thinks that a > > problem > > is. > > some warnings can only be determined when doing a full simulation (ie. > like ptest does) > unfortunately crm_verify doesn't always have the status section and so > can't do a full simulation. > > though when called with -L it would... i'll fix that for the next > version > Ah, that's great, thank you. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] question regarding quorumd
Hi, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > On Nov 13, 2007, at 11:13 AM, Sebastian Reitenbach wrote: > > > Hi, > > > > Andrew Beekhof <[EMAIL PROTECTED]> wrote: > >> > >> On Nov 9, 2007, at 4:34 PM, Sebastian Reitenbach wrote: > >> > >>> Hi, > >>> > >>> I did some tests with a two node cluster and a third one running a > >>> quorumd. > >>> > >>> I started the quorumd, and then the two cluster nodes. > >>> The one that became DC, started to communicate with the remote > >>> quorumd. > >> > >> The CRM (and thus the "DC") doesn't know anything about quorumd > >> I believe this is purely the domain of the CCM and I've no idea how > >> that works :-) > >> > >> We just consume membership data from it... > >> > >> So anyway, my point is that the fact that a node is the DC is > >> irrelevant when it comes to quorumd. > > but somehow the cluster knows, as only the DC is communicating with > > the > > external quorumd. > > I think that its just a co-incidence that it happens to be the DC... > at least I hope it is. I thought I read somewhere, that the DC is the one in charge of communicating with the remote quorumd, but I may be wrong here. > > > I just do not understand, why the cluster does not retry > > to re-contact the quorumd after it lost connection to it. This was > > what I > > assumed, after a disconnect to the remote quorumd, the cluster nodes > > should > > try to contact it, and when the contact is there again, use it again. > > I agree - but I've never seen that code. You'll have to contact alan > or file a bug for him. Alan, in case you think this is a bug, I'll go create a bug report for it. Please let me know. > > >>> I killed the DC, saw the other becoming DC, and start communicating > >>> to the remote quorumd, all fine, cluster still with quorum. > >>> Then I killed the quorumd itself, the DC recognized, and started to > >>> stop > >>> all resource, because of the quorum_policy, as it lost quorum. > >>> > >>> Then I restarted the quorumd again, but the DC, still without > >>> quorum, > >>> did not tried to communicate to the quorumd again. > >>> I'd expect the still living DC to try to contact the quorumd, in > >>> case it > >>> comes back. > >>> > >>> If there is a good reason, why the DC is not trying to reconnect to > >>> the > >>> remote quorumd I'd really like to get enlightened from someone who > >>> knows. > >>> kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] question regarding quorumd
Hi, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > On Nov 9, 2007, at 4:34 PM, Sebastian Reitenbach wrote: > > > Hi, > > > > I did some tests with a two node cluster and a third one running a > > quorumd. > > > > I started the quorumd, and then the two cluster nodes. > > The one that became DC, started to communicate with the remote > > quorumd. > > The CRM (and thus the "DC") doesn't know anything about quorumd > I believe this is purely the domain of the CCM and I've no idea how > that works :-) > > We just consume membership data from it... > > So anyway, my point is that the fact that a node is the DC is > irrelevant when it comes to quorumd. but somehow the cluster knows, as only the DC is communicating with the external quorumd. I just do not understand, why the cluster does not retry to re-contact the quorumd after it lost connection to it. This was what I assumed, after a disconnect to the remote quorumd, the cluster nodes should try to contact it, and when the contact is there again, use it again. kind regards Sebastian > > > > > I killed the DC, saw the other becoming DC, and start communicating > > to the remote quorumd, all fine, cluster still with quorum. > > Then I killed the quorumd itself, the DC recognized, and started to > > stop > > all resource, because of the quorum_policy, as it lost quorum. > > > > Then I restarted the quorumd again, but the DC, still without quorum, > > did not tried to communicate to the quorumd again. > > I'd expect the still living DC to try to contact the quorumd, in > > case it > > comes back. > > > > If there is a good reason, why the DC is not trying to reconnect to > > the > > remote quorumd I'd really like to get enlightened from someone who > > knows. > > > > kind regards > > Sebastian > > > > ___ > > Linux-HA mailing list > > Linux-HA@lists.linux-ha.org > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > > ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] log warnings, but when I check no error seems to be there
Hi, I see a lot of these messages in my logfile: pengine[12757]: 2007/11/13_10:27:02 WARN: process_pe_message: Transition 7687: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/heartbeat/pengine/pe-warn-8072.bz2 pengine[12757]: 2007/11/13_10:27:02 info: process_pe_message: Configuration WARNINGs found during PE processing. Please run "crm_verify -L" to identify issues. but when I check crm_verify -L then nothing shows up, I also did a: bzcat /var/lib/heartbeat/pengine/pe-warn-8072.bz2 | crm_verify -p this command also produced no output. I am in a two node cluster, where one node is stopped, maybe that is the reason? What else could I do to figure out what the cluster thinks that a problem is. I am using heartbeat 2.1.2-4.1 on opensuse 10.2 x86_64 kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pingd removed transient attr from node attributes after short network outage, and did not recreated it
Hi, > > This is what happened: Due to a membership issue which has been > only recently resolved, the crmd/cib combo would jointly leave > the cluster. Other cib clients where supposed to follow in order > to be started again by the master process and then connect to the > new cib instance. But attrd doesn't have such a feature. There's > now a bugzilla for that: > > http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1776 > > In the meantime, you could try with a newer heartbeat version. Thanks, I'll retest when the next interim version is out. > > Thanks for the report. > No problem, I only wanted to know whether I am right, or the cluster ;) kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] pingd removed transient attr from node attributes after short network outage, and did not recreated it
Hi, I did some more tests with my two node cluster, regarding pingd. I started the two node cluster. Both nodes came up, resources are distributed as the location constraints define it. The location of the Xen resources are dependent on pingd attributes. Then on the only one ping node, I flushed the state tables, and only allowed pings from the host ppsdb101. I saw the Xen resources moving, everything great. I changed the Firwall on the Ping node to only allow pings from the ppsnfs101 host. Well, all four Xen resources moved over to the ppsnfs101 host. At 16:17 the I disabled the both ports of the switch where the nodes are connected, e.g. a real life usecase would be: 1. non redundant netork layout 2. no stonith, or stonith over network (e.g. ilo or ssh) 3. someone removes power from the switch where both nodes are connected Then I waited about 10 seconds, and enabled both ports again. The RSTP took some more seconds to restructure. After that both nodes could communicate again with each other, and the pings are reaching the ping node again, the lines that the pingd produces as transient attributes to the nodes, were both gone. Before I removed the cable, I issued a cibadmin -Q -o status | grep ping and the two lines, one for each host, showed up, after disconnecting both hosts, and reconnecting, rerunning the cibadmin command, showed me, both attr lines were gone. I did wait for about 5-10 minutes but it did not came back. I did that several times, with one or the other node or both being able to ping the ping node before disabling the switch ports. I expected the transient pingd attributes that the nodes had, A) not to disappear, but only get resetted to 0 B) In case it is ok that they disappeared, I expected them to come back, when they are receiving echo replies from the ping node again. But maybe I am still missing sth or misunderstood. Who is right, me or the cluster? output of hb_report is attached. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problem with locations depending on pingd
Hi, > > > > Looks like heartbeat didn't notice the ping node went away. > > If that doesn't happen, then the score wouldn't change. > > > > Are you sure you made the right change? > 100% sure, I tested it several times. Started the ping node with allowing > pings from say node A, but not node B, made sure with manual ping. Then > started the cluster, and I saw all resources starting on A. Then > reconfiguring the firwall on the ping node to answer pings from A and B, > no need to check that it works, I just saw some of the resources > migrating... Up to that point everything was as I expected. Then I could > reconfigure the firewall on the ping node to not answer pings from either A > or B anymore, but the value of pingd in the node attributes was not reset to > 0. This is what I observed. Well, while writing this, I did not fired up > tcpdump to see whether the answers really stopped, maybe the ping node kept > track of some states? But I manually pinged the ping node from the cluster > node that I disabled, and I did not got an answer. > dumb user error on my side. After starting tcpdump on the ping node, and reconfiguring the firewall, I saw it was as I thought, the firewall was too smart for me ;) After flushing the state tables, the firewall stopped answering the pings, and the attribute got reset, so everything works now as expected. sorry for the noise thanks Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problem with locations depending on pingd
Hi, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > On Nov 9, 2007, at 1:44 PM, Sebastian Reitenbach wrote: > > > Hi, > > > > I changed the resources to look like this: > > > > > > > >> operation="not_defined"/> > >> operation="lte" > > value="0"/> > > > > > >> operation="eq" > > value="ppsnfs101"/> > > > > > > > > > > It seems to work well on startup, but I still have the same problem > > that the > > attribute that the pingd sets is not reset to 0 when pingd stops > > receiving > > ping answers from the ping node. > > Looks like heartbeat didn't notice the ping node went away. > If that doesn't happen, then the score wouldn't change. > > Are you sure you made the right change? 100% sure, I tested it several times. Started the ping node with allowing pings from say node A, but not node B, made sure with manual ping. Then started the cluster, and I saw all resources starting on A. Then reconfiguring the firwall on the ping node to answer pings from A and B, no need to check that it works, I just saw some of the resources migrating... Up to that point everything was as I expected. Then I could reconfigure the firewall on the ping node to not answer pings from either A or B anymore, but the value of pingd in the node attributes was not reset to 0. This is what I observed. Well, while writing this, I did not fired up tcpdump to see whether the answers really stopped, maybe the ping node kept track of some states? But I manually pinged the ping node from the cluster node that I disabled, and I did not got an answer. Sebastian Sebastian > > > > > I created a bugzilla entry, with a hb_report appended: > > http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi? > > id=1770 > > > > kind regards > > Sebastian > > > > > > Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA > > mailing list wrote: > >> Hi Dejan, > >> > >> thank you very much for your helpful hints, I got it mostly > >> working. I > >> initially generated the constraints via the GUI, and did not > >> recognized > > the > >> subtle differences.I changed them manually to look like what you > > suggested, > >> in your first example. I have to admit, I did not tried yet the - > >> INFINITY > >> example you gave, where the resources will refuse to work on a node > > without > >> connectivity. Because I think it would not work, when I see my > > observations: > >> > >> In the beginning, after cluster startup, node > >> 262387d6-3ba0-4001-95c6-f394d1ba243f > >> is not able to ping, node 15854123-86ef-46bb-bf95-79c99fb62f46 is > >> able to > >> ping > >> the defined ping node. > >> cibadmin -Q -o status | grep ping > >> >> provider="heartbeat"> > >> >> provider="heartbeat"> > >>>> name="pingd" value="0"/> > >> >> provider="heartbeat"> > >> >> provider="heartbeat"> > >>>> name="pingd" value="100"/> > >> > >> then, all four resources are on host 15854123-86ef-46bb- > >> bf95-79c99fb62f46, > >> so everything as I expected. > >> then, I changed the firewall to not answer pings from > >> 15854123-86ef-46bb-bf95-79c99fb62f46 > >> but instead answer pings from: 262387d6-3ba0-4001-95c6- > >> f394d1ba243f, then > >> it took some seconds, and the output changed to: > >> > >> cibadmin -Q -o status | grep ping > >>>> name="pingd" value="100"/> > >> >> provider="heartbeat"> > >> >> provider="heartbeat"> > >>>> name="pingd" value="100"/> > >> >> provider="heartbeat"> > >> >> provider="heartbeat"> > >> > >> and two of the resources went over to the node > >> 262387d6-3ba0-4001-95c6-f394d1ba243f. > >> > >> but also after some more minutes, the output of cibadmin -Q -o > >> status | > > grep > >> ping > >> did not changed again. Id expected it to look like this: > >>>> n
[Linux-HA] question regarding quorumd
Hi, I did some tests with a two node cluster and a third one running a quorumd. I started the quorumd, and then the two cluster nodes. The one that became DC, started to communicate with the remote quorumd. I killed the DC, saw the other becoming DC, and start communicating to the remote quorumd, all fine, cluster still with quorum. Then I killed the quorumd itself, the DC recognized, and started to stop all resource, because of the quorum_policy, as it lost quorum. Then I restarted the quorumd again, but the DC, still without quorum, did not tried to communicate to the quorumd again. I'd expect the still living DC to try to contact the quorumd, in case it comes back. If there is a good reason, why the DC is not trying to reconnect to the remote quorumd I'd really like to get enlightened from someone who knows. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] question regarding to quorumd
Hi, Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA mailing list wrote: > Hi, > > here: http://www.linux-ha.org/QuorumServerGuide > I read that for the /etc/ha.d/quorumd.conf the version has to be: > the version of the protocol between the quorum server and its clients (2_0_8 > is the only version supported now) > > Is this still true for newer version of heartbeat too, e.g. I use heartbeat > 2.1.2, but maybe the quorum protocol version is still the same? > I think I can answer the question myself, I found the file: /usr/lib64/heartbeat/plugins/quorumd/2_0_8.so. So I assume, it is still the version 2_0_8. sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problem with locations depending on pingd
Hi, I changed the resources to look like this: It seems to work well on startup, but I still have the same problem that the attribute that the pingd sets is not reset to 0 when pingd stops receiving ping answers from the ping node. I created a bugzilla entry, with a hb_report appended: http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1770 kind regards Sebastian Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA mailing list wrote: > Hi Dejan, > > thank you very much for your helpful hints, I got it mostly working. I > initially generated the constraints via the GUI, and did not recognized the > subtle differences.I changed them manually to look like what you suggested, > in your first example. I have to admit, I did not tried yet the -INFINITY > example you gave, where the resources will refuse to work on a node without > connectivity. Because I think it would not work, when I see my observations: > > In the beginning, after cluster startup, node > 262387d6-3ba0-4001-95c6-f394d1ba243f > is not able to ping, node 15854123-86ef-46bb-bf95-79c99fb62f46 is able to > ping > the defined ping node. > cibadmin -Q -o status | grep ping > provider="heartbeat"> > provider="heartbeat"> > name="pingd" value="0"/> > provider="heartbeat"> > provider="heartbeat"> > name="pingd" value="100"/> > > then, all four resources are on host 15854123-86ef-46bb-bf95-79c99fb62f46, > so everything as I expected. > then, I changed the firewall to not answer pings from > 15854123-86ef-46bb-bf95-79c99fb62f46 > but instead answer pings from: 262387d6-3ba0-4001-95c6-f394d1ba243f, then > it took some seconds, and the output changed to: > > cibadmin -Q -o status | grep ping > name="pingd" value="100"/> > provider="heartbeat"> > provider="heartbeat"> > name="pingd" value="100"/> > provider="heartbeat"> > provider="heartbeat"> > > and two of the resources went over to the node > 262387d6-3ba0-4001-95c6-f394d1ba243f. > > but also after some more minutes, the output of cibadmin -Q -o status | grep > ping > did not changed again. Id expected it to look like this: > name="pingd" value="100"/> > provider="heartbeat"> > provider="heartbeat"> > name="pingd" value="0"/> > provider="heartbeat"> > provider="heartbeat"> > and that the two resources from 15854123-86ef-46bb-bf95-79c99fb62f46 would > migrate to > node 262387d6-3ba0-4001-95c6-f394d1ba243f > > My assumption is, that the -INFINITY example would only work, when the value > for the id > status-15854123-86ef-46bb-bf95-79c99fb62f46-pingd would be resetted to 0 at > some > point, but it is not. Therefore I did not tried. > > > below are my constraints, the ping clone resource, and an exemplary Xen > resource. > > > to="MGMT_DB" action="start" symmetrical="false" score="0"/> > to="NFS_MH" action="start" symmetrical="false" score="0"/> > to="NFS_SW" action="start" symmetrical="false" score="0"/> > to="NFS_SW" action="start" symmetrical="false" score="0"/> > to="NFS_MH" action="start" symmetrical="false" score="0"/> > to="NFS_SW" action="start" symmetrical="false" score="0"/> > > > id="e248586f-284b-4d6e-86a1-86ac54cecb3d" operation="defined"/> > > > > > id="ccd4c85c-7b30-48c5-806e-d37a42e3db5b" operation="defined"/> > > > > > id="ff209e83-ac2e-4dad-901b-f6496c652f3b" operation="defined"/> > > > > > id="4349f298-2f36-4bfa-9318-ed9863ab32bb" operation="defined"/> > > > > > > > > > > value="started"/> > value="2"/> > name="clone_node_max" value="1"/> > name="globally_unique" value="false"/&
[Linux-HA] question regarding to quorumd
Hi, here: http://www.linux-ha.org/QuorumServerGuide I read that for the /etc/ha.d/quorumd.conf the version has to be: the version of the protocol between the quorum server and its clients (2_0_8 is the only version supported now) Is this still true for newer version of heartbeat too, e.g. I use heartbeat 2.1.2, but maybe the quorum protocol version is still the same? kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] observations after some fencing tests in a two node
Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA mailing list wrote: > Hi, > Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > Hi, > > > > On Wed, Nov 07, 2007 at 04:43:32PM +0100, Sebastian Reitenbach wrote: > > > Hi all, > > > > > > I did some fencing tests in a two node cluster, here are some details of > my > > > setup: > > > > > > - use stonith external/ilo for fencing (ssh to ilo board and issue a > reset > > > command) > > > - both nodes are connected via two bridged ethernet interfaces to two > > > redundant switches. The ilo boards are connected to the each of the > > > switches. > > > > > > My first observation: > > > - when removing the network cables from the node that is the DC at the > > > moment, it took at least three minutes, until it decided to stonith the > > > other node and to startup the resources that ran on the node without > network > > > connectivity > > > - when removing the network cables from the node that is not the DC, > then it > > > was a matter of e.g. 20 seconds, then this node fenced the DC, and then > > > became DC > > > > This definitely deserves a set of logs, etc (is your hb_report > > operational? :). > humm, yes, with the latest patches (: > ok, I'll reproduce the problem and create a report. I seem to be unable to reproduce the problem, that day when it happened, there must have been some orphaned actions/resources whatever in the way that meanwhile disappeared. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problem with locations depending on pingd
Hi Dejan, thank you very much for your helpful hints, I got it mostly working. I initially generated the constraints via the GUI, and did not recognized the subtle differences.I changed them manually to look like what you suggested, in your first example. I have to admit, I did not tried yet the -INFINITY example you gave, where the resources will refuse to work on a node without connectivity. Because I think it would not work, when I see my observations: In the beginning, after cluster startup, node 262387d6-3ba0-4001-95c6-f394d1ba243f is not able to ping, node 15854123-86ef-46bb-bf95-79c99fb62f46 is able to ping the defined ping node. cibadmin -Q -o status | grep ping then, all four resources are on host 15854123-86ef-46bb-bf95-79c99fb62f46, so everything as I expected. then, I changed the firewall to not answer pings from 15854123-86ef-46bb-bf95-79c99fb62f46 but instead answer pings from: 262387d6-3ba0-4001-95c6-f394d1ba243f, then it took some seconds, and the output changed to: cibadmin -Q -o status | grep ping and two of the resources went over to the node 262387d6-3ba0-4001-95c6-f394d1ba243f. but also after some more minutes, the output of cibadmin -Q -o status | grep ping did not changed again. Id expected it to look like this: and that the two resources from 15854123-86ef-46bb-bf95-79c99fb62f46 would migrate to node 262387d6-3ba0-4001-95c6-f394d1ba243f My assumption is, that the -INFINITY example would only work, when the value for the id status-15854123-86ef-46bb-bf95-79c99fb62f46-pingd would be resetted to 0 at some point, but it is not. Therefore I did not tried. below are my constraints, the ping clone resource, and an exemplary Xen resource. kind regards Sebastian Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > Hi, > > On Wed, Nov 07, 2007 at 06:31:54PM +0100, Sebastian Reitenbach wrote: > > Hi, > > > > I tried to follow http://www.linux-ha.org/pingd, the section > > "Quickstart - Only Run my_resource on Nodes with Access to at Least One Ping > > Node" > > > > therefore I have created the following pingd resources: > > > > > > > > because all the clones will be equal. > > > > > > > > value="started"/> > > > value="2"/> > > > name="clone_node_max" value="1"/> > > > > > > > > > > > > > value="/tmp/PING.pid"/> > > > value="root"/> > > > name="host_list" value="192.168.102.199"/> > > > value="pingd"/> > > add these two > > > > > > > > > > > > > > > > > and here is my location constraint (entered via hb_gui, thererfore is a > > value there): > > > > > > > > > id="4349f298-2f36-4bfa-9318-ed9863ab32bb" operation="defined" value="af"/> > > > > Looks somewhat strange. There are quite a few better examples on > the page you quoted: > > > >attribute="pingd" operation="defined"/> > > > > or, perhaps better: > > > > > > > > > The latter will have a score of -INFINITY for all nodes which > don't have an attribute or it's value is zero thus preventing the > resource from running there. > > > The 192.168.102.199 is just an openbsd host, pingable from both cluster > > nodes. The NFS_MH resource is a Xen domU. > > On startup of the two cluster nodes, the NFS_MH node went to node1. > > Then I reconfigured the firewall of the ping node to only answer > > pings from node2. > > In the cluster itself, nothing happened, but I expected the resource to > > relocate to the node with connectivity. I still must do sth. wrong I think, > > any hints? > >
Re: [Linux-HA] Heartbeat lrmd is core dumping
General Linux-HA mailing list wrote: > Along these lines, do you have to take the entire cluster down to do an > upgrade of heartbeat (2.0.8 to 2.1.2), or can you take one node down, > upgrade it, bring it back in the cluster, take down the other, etc? The second way should work as you described. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] observations after some fencing tests in a two node
Hi, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > Hi, > > On Wed, Nov 07, 2007 at 04:43:32PM +0100, Sebastian Reitenbach wrote: > > Hi all, > > > > I did some fencing tests in a two node cluster, here are some details of my > > setup: > > > > - use stonith external/ilo for fencing (ssh to ilo board and issue a reset > > command) > > - both nodes are connected via two bridged ethernet interfaces to two > > redundant switches. The ilo boards are connected to the each of the > > switches. > > > > My first observation: > > - when removing the network cables from the node that is the DC at the > > moment, it took at least three minutes, until it decided to stonith the > > other node and to startup the resources that ran on the node without network > > connectivity > > - when removing the network cables from the node that is not the DC, then it > > was a matter of e.g. 20 seconds, then this node fenced the DC, and then > > became DC > > This definitely deserves a set of logs, etc (is your hb_report > operational? :). humm, yes, with the latest patches (: ok, I'll reproduce the problem and create a report. > > > Why is there such a difference? The first one takes too long in my eyes to > > detect the outage, but I hope there are timeout values that I can tweak. For > > which ones shall I take a look? > > deadtime in ha.cf. > > > Also I recognized the following line in the logfile from the DC in the first > > case: > > tengine: ... info: extract_event: Stonith/shutdown of not matched > > This line shows up immediately after the DC detects that the other node is > > unreachable. From then it takes at least two minutes until the DC decides to > > fence the other node. > > Looks like a kind of misunderstanding between the CRM and > stonithd. Again, a report would hopefully reveal what's going on. > If you could turn debug on, that'd be great. A bugzilla is > fine too. I'll do, with above logs attached. > > > The second thing I observed: > > My stonith is working via ssh to the ilo board to the node that shall be > > fenced. When I remove the ethernet cables from one node, stonith will fail > > to kill the other node. > > > > take case two from above, remove the cables from the node that is not the > > DC, where I observed the following: > > The DC needs about some minutes to decide to fence the other node, because > > of the above observed behaviour. Meanwhile the non DC node without network > > cables tried to fence the DC, that failed, and the node was in a unclean > > state, until the DC fenced it in the end. > > Luckily the stonith of the DC failed, then assume instead of ssh as stonith > > resource, use a stonith devied connected to e.g. serial port. > > In that case, the non DC node were able to fence the DC, and then become DC > > itself, starting all resources, mounting all filesystems, ... > > Meanwhile the DC is restarted, and either heartbeat is not started > > automatically, then the cluster is unusable, because the one node that is DC > > has no network. Or when heartbeat is started automatically, it cannot > > communicate to the second node, and will assume this one is dead, > > and will insist on reseting it. Which would result in a yo-yo > machinery. Not entirely useful. This kind of lack of > communication is obviously detrimental, and that in spite of the > stonith configured. Right now don't see a solution to this issue. > Apart from pingd. > > > and start > > all its resources, so that e.g. filesystems could be mounted on both nodes. > > > > I don't have a hardware fencing device to test my theory, but could that > > happen or not? Could the usage of some ping nodes, combined with a pingd or > > an external quorumd help to solve the dilemma? > > A pingd resource with appropriate constraints would help, i.e. > something like "don't run resources if the pingd attribute is > zero". I am already fiddling around with ping, but doesn't seem to get it to work, see the other thread: "problem with locations depending on pingd" > > > Well, I am running heartbeat 2.1.2-15 on sles10sp1, any hints and comments > > are appreciated. > > Thanks, > > Dejan thank you, Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problem with locations depending on pingd
Hi, Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA mailing list wrote: > Hi, > > I tried to follow http://www.linux-ha.org/pingd, the section > "Quickstart - Only Run my_resource on Nodes with Access to at Least One Ping > Node" > > therefore I have created the following pingd resources: > > > > > value="started"/> > value="2"/> > name="clone_node_max" value="1"/> > > > > > > value="/tmp/PING.pid"/> > value="root"/> > name="host_list" value="192.168.102.199"/> > value="pingd"/> > > > > > > > and here is my location constraint (entered via hb_gui, thererfore is a > value there): > > > > id="4349f298-2f36-4bfa-9318-ed9863ab32bb" operation="defined" value="af"/> > > > > The 192.168.102.199 is just an openbsd host, pingable from both cluster > nodes. The NFS_MH resource is a Xen domU. > On startup of the two cluster nodes, the NFS_MH node went to node1. > Then I reconfigured the firewall of the ping node to only answer > pings from node2. > In the cluster itself, nothing happened, but I expected the resource to > relocate to the node with connectivity. I still must do sth. wrong I think, > any hints? I am still fiddling around to get the location based on ping connectivity working, I changed the location score from 100 to INFINITY. When the pingd resource is started, I see the following in /var/log/messages: ov 8 11:48:17 ppsnfs101 pingd: [16543]: info: do_node_walk: Requesting the list of configured nodes Nov 8 11:48:18 ppsnfs101 pingd: [16543]: info: send_update: 1 active ping nodes Nov 8 11:48:18 ppsnfs101 pingd: [16543]: info: main: Starting pingd Nov 8 11:55:24 ppsnfs101 pingd: [21205]: info: Invoked: /usr/lib64/heartbeat/pingd -a pingd -d 1s -h 192.168.102.199 and this shows up, when I take a look at the process list: 6498 ?SL 0:00 heartbeat: write: ping 192.168.102.199 6499 ?SL 0:00 heartbeat: read: ping 192.168.102.199 16543 ?S 0:00 /usr/lib64/heartbeat/pingd -D -p /tmp/PING.pid -a pingd -d 1s -h 192.168.102.199 21709 pts/0S+ 0:00 grep ping So the ping node was reachable from both cluster nodes on startup, therefore the resources started up on one of the hosts. I then changed the firewall rules in the ping node, to only answer pings where the resource is not running on it, but nothing happend. no new ping related output in /var/log/messages, nor the ha-log files, on both hosts. I expected a note in one of the logfiles, that ping is not working anymore, and the resource would migrate to the host with connectivity. I also had both cluster nodes in standby, changed the firewall on the ping node to only answer pings from one cluster node. Then started both nodes, and I saw the resource starting on the cluster node which had no connectivity. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] observations after some fencing tests in a twonode
"matilda matilda" <[EMAIL PROTECTED]> wrote: > >>> "Sebastian Reitenbach" <[EMAIL PROTECTED]> 07.11.2007 16:43 >>> > > >The second thing I observed: > >My stonith is working via ssh to the ilo board to the node that shall be > >fenced. When I remove the ethernet cables from one node, stonith will fail > >to kill the other node. > > Hi Sebastian, > > the answers to your questions will be interesting. :-) > > One additional question by me. > How did you set up the stonith device? External stonith plugin? > Where does this stonith resource run? One stonith resource for > both nodes or one for each? I have a cloned stonith resource, it is external/ssh, so every node can fence any other node via its ilo board. http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1649 I created it some time ago, the version there is not the latest one, I just stumbled over a bug today, while testing, that I fixed, but not yet uploaded again. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] problem with locations depending on pingd
Hi, I tried to follow http://www.linux-ha.org/pingd, the section "Quickstart - Only Run my_resource on Nodes with Access to at Least One Ping Node" therefore I have created the following pingd resources: and here is my location constraint (entered via hb_gui, thererfore is a value there): The 192.168.102.199 is just an openbsd host, pingable from both cluster nodes. The NFS_MH resource is a Xen domU. On startup of the two cluster nodes, the NFS_MH node went to node1. Then I reconfigured the firewall of the ping node to only answer pings from node2. In the cluster itself, nothing happened, but I expected the resource to relocate to the node with connectivity. I still must do sth. wrong I think, any hints? kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] observations after some fencing tests in a two node
Hi all, I did some fencing tests in a two node cluster, here are some details of my setup: - use stonith external/ilo for fencing (ssh to ilo board and issue a reset command) - both nodes are connected via two bridged ethernet interfaces to two redundant switches. The ilo boards are connected to the each of the switches. My first observation: - when removing the network cables from the node that is the DC at the moment, it took at least three minutes, until it decided to stonith the other node and to startup the resources that ran on the node without network connectivity - when removing the network cables from the node that is not the DC, then it was a matter of e.g. 20 seconds, then this node fenced the DC, and then became DC Why is there such a difference? The first one takes too long in my eyes to detect the outage, but I hope there are timeout values that I can tweak. For which ones shall I take a look? Also I recognized the following line in the logfile from the DC in the first case: tengine: ... info: extract_event: Stonith/shutdown of not matched This line shows up immediately after the DC detects that the other node is unreachable. From then it takes at least two minutes until the DC decides to fence the other node. The second thing I observed: My stonith is working via ssh to the ilo board to the node that shall be fenced. When I remove the ethernet cables from one node, stonith will fail to kill the other node. take case two from above, remove the cables from the node that is not the DC, where I observed the following: The DC needs about some minutes to decide to fence the other node, because of the above observed behaviour. Meanwhile the non DC node without network cables tried to fence the DC, that failed, and the node was in a unclean state, until the DC fenced it in the end. Luckily the stonith of the DC failed, then assume instead of ssh as stonith resource, use a stonith devied connected to e.g. serial port. In that case, the non DC node were able to fence the DC, and then become DC itself, starting all resources, mounting all filesystems, ... Meanwhile the DC is restarted, and either heartbeat is not started automatically, then the cluster is unusable, because the one node that is DC has no network. Or when heartbeat is started automatically, it cannot communicate to the second node, and will assume this one is dead, and start all its resources, so that e.g. filesystems could be mounted on both nodes. I don't have a hardware fencing device to test my theory, but could that happen or not? Could the usage of some ping nodes, combined with a pingd or an external quorumd help to solve the dilemma? Well, I am running heartbeat 2.1.2-15 on sles10sp1, any hints and comments are appreciated. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] best practices monitoring services in Xen instances
Hi Andrew, "Andrew Beekhof" <[EMAIL PROTECTED]> wrote: > On 11/5/07, Sebastian Reitenbach <[EMAIL PROTECTED]> wrote: > > Hi, > > > > to remove complexity from my cluster, I am experimenting with Xen. > > Starting and stopping the Xen resources via heartbeat works well already. > > I am a bit concerned about the services in the virtual machines, how is the > > best approach to monitor their availability? > > what you're talking about is basically having the crm manage resources > on non-cluster nodes. > > we've kicked around some ideas for implementing this in the past but > its never really bubbled to the top of anyone's todo list. > > there's not really any "best practices" for this as its not really > being done a whole lot (from what I hear anyway). depending on how > complex the relationships between the resources inside the Xen guests > are, i'd go with option 1 (if they're complex) or 2 (if not) thank you for your comments. I more or less have to check that the services not get killed by the OOM killer, e.g. when i have 3 domU's running, and I want to start a 4. node, but I have no free memory, available, then I have to shrink the memory of the already running domU's via xen'S mem-set. But when I do that, it can happen that the OOM killer in the domU will kill my services, that the domU is intended to provide. Unfortunately, heartbeat has nothing to detect that, yet. I am just tweaking the Xen resource script. I added a parameter, OCF_RESKEY_monitor_scripts, that the Xen resource script will run when the monitor action for the domU is called. These custom scripts will test the services assigned to the domU, in case one fails, then the whole domU will be restarted via heartbeat, and then hopefully get the internal service restarted too. Sebastian > > > > > I have some solutions, but would like to know what corresponds to best > > practice: > > > > - install heartbeat in the virtual domains too, then monitor the resources > > within the xen instance, but I think this is counterproductive as I wanted > > to remove complexity from the cluster due to having less resources. > > > > - monitor the services in the virtual domains using SNMP, or custom scripts, > > and in case sth. fails, crm_resource stop and setart it again. Well, custom > > scripts sounds a bit error prone. > > > > - I don't know whether xen has the ability, but does the priviliged domain > > has the ability to query a given domU for the state of a process, and in > > case the state is just not the wanted one, restart the domU. > > > > I think the last one, would be the best, but I have no idea, whether xen can > > do that at all. I played around with OpenVZ for a short while, that at least > > could do that. Any other ideas, comments, rants are very welcome. > > > > kind regards > > Sebastian > > > > ___ > > Linux-HA mailing list > > Linux-HA@lists.linux-ha.org > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > > > ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] best practices monitoring services in Xen instances
Hi, to remove complexity from my cluster, I am experimenting with Xen. Starting and stopping the Xen resources via heartbeat works well already. I am a bit concerned about the services in the virtual machines, how is the best approach to monitor their availability? I have some solutions, but would like to know what corresponds to best practice: - install heartbeat in the virtual domains too, then monitor the resources within the xen instance, but I think this is counterproductive as I wanted to remove complexity from the cluster due to having less resources. - monitor the services in the virtual domains using SNMP, or custom scripts, and in case sth. fails, crm_resource stop and setart it again. Well, custom scripts sounds a bit error prone. - I don't know whether xen has the ability, but does the priviliged domain has the ability to query a given domU for the state of a process, and in case the state is just not the wanted one, restart the domU. I think the last one, would be the best, but I have no idea, whether xen can do that at all. I played around with OpenVZ for a short while, that at least could do that. Any other ideas, comments, rants are very welcome. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] hb_gui failing to authenticate... although it hasworkedin the past
John Gardner <[EMAIL PROTECTED]> wrote: > > > > > Well, I don't know how to fix your GUI problem, but what about using > > cibadmin -Q -o resources, to get all configured resources, edit the output, > > add a new IP address resource, and add it cibadmin -U to update it? > > > > Sebastian > > > > Seabastian, now I have another problem (lack of knowledge on my part!) > > When I type: > > cibadmin -Q -o resources > > I get: > > > id="virtual_ip"> > > > value="192.168.1.74"/> > > > > > > How would I add another virtual ip address using cibadmin? > > Presumably I'd use the modify (-M) switch? How do I generate the unique > nvpair id? take sth. like this: and update the cib with cibadmin -U -o resources -x resources.xml I am not perfectly sure about whether -U, or -R, or -M is the correct parameter. where resources.xml is a file containing above example. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] hb_gui failing to authenticate... although it hasworked in the past
Hi, General Linux-HA mailing list wrote: > Dejan Muhamedagic wrote: > > Hi, > > > > On Wed, Oct 31, 2007 at 01:44:14PM +, John Gardner wrote: > >> I'm running CentOS 4.5 and I've had heartbeat up and running > >> successfully for a number of months, and configured it initially with > >> hb_gui, but I've never really used it since then... Anyway, 4 months on > >> I powered up hb_gui to make some changes and it won't connect to the > >> heartbeat server, it gives this error: > >> > >> mgmtd: [23909]: ERROR: on_listen receive login msg failed > > > > The first message on the connection was not a proper login > > message. This used to happen in the past also because of > > different client and server versions. > > > >> Why would it suddenly stop working? > > > > Are you absolutely sure that nothing changed in the meantime? > > Things don't just stop working. Typically there is a reason. > > Yeah, I agree. Things don't stop working... but this has. OK, that's > not true. hb_gui hasn't stopped working, the GUI still appears and > heartbeat is still operating fine, the only problem is that hb_gui will > now no longer connect to heartbeat. > > > > > Which heartbeat version do you run? > > > > I'm using 2.0.8 which is packaged for CentOS. I'm using 2.0.8. version > of of hb_gui and 2.0.8 of heartbeat. > > I've installed hb_gui on three separate servers, two access the > heartbeat server via VPN and the third is on the same subnet as the > heartbeat server. All three connect using: > > Server(:port) 192.168.1.65:694 > Username heartbeat > Password xxx > > The 'heartbeat' user is definitely in the haclient group (see /etc/group > below) > > haclient:x:90:heartbeat > > But every time I try to connect from either hb_gui client I get: > > Failed in the authentication > User Name or Password may be wrong. > or the user desn't belong to the haclient group > > on the GUI and the following in the log: > > Nov 5 10:26:56 server01 mgmtd: [25556]: ERROR: on_listen receive login > msg failed > Nov 5 10:32:10 server01 mgmtd: [25556]: ERROR: on_listen receive login > msg failed > > I'm at a bit of a loss what to do next really, I need to add another > Virtual IP today, but I can't connect :-( Well, I don't know how to fix your GUI problem, but what about using cibadmin -Q -o resources, to get all configured resources, edit the output, add a new IP address resource, and add it cibadmin -U to update it? Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Patch to fix long interface names in IPaddr.
Hi, General Linux-HA mailing list wrote: > > On Nov 2, 2007, at 7:22 AM, Sean Reifschneider wrote: > > > If you have a long interface name, such as "vlan1000", ifconfig cuts > > off > > alias names so that it shows "vlan1000:" instead of "vlan1000:0". > > This is > > on probably pretty much all Linux, but specifically we were using > > Debian > > Etch. I presume that there would be a similar problem for >9 > > aliases on an > > interface named something like "vlan999" or >99 aliases on an > > interface > > named "vlan99". > > > > The behavior is that "start" works, but stop tries to remove the > > alias from > > "vlan1000:", which fails, leaving the IP up on the passive machine > > if you > > have gracefully failed over (and STONITH doesn't kill the previously- > > active > > node). > > > > I tracked this down via the logs, and Scott Kleihege used his awk-fu > > to work > > up the following patch. I'm not sure if you'll want to include this > > as it > > relies on the "iproute2" program "ip" to be installed, > > yeah, thats going to be problematic as IPaddr needs to work on non- > linux systems. At least this shouldn't harm OpenBSD, because the alias interface notation is different than in Linux, the : notation is not used there. > > IPaddr2 has always been linux specific though... > Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] hb_gui failing to authenticate... although it hasworkedin the past
Hi, John Gardner <[EMAIL PROTECTED]> wrote: > Sebastian Reitenbach wrote: > > Hi, > > > > General Linux-HA mailing list wrote: > >> I've checked by connecting to the server using hb_gui on the same subnet > >> so I know it's not firewall related. It has just inexplicitly stopped > >> working! > >> > >> Does any one have any other ideas as to why the hb_gui connection has > >> stopped working? Is there any other way to set up a virtual ip without > >> using hb_gui? > > I had a similar problem on openSUSE 10.2. Check /etc/pam.d/hbmgmtd, and > > change pam_unix.so to pam_unix2.so. That fixed the problem for me on > > openSUSE, despite I just checked on SLES10 SP1, and I only have pam_unix.so > > there but is working. > > > > Sebastian > > > > Thanks Sebastian > > Will pam_unix and pam_unix2 coexist on the same box? On CentOS pam_unix > is installed from a package, pam_unix2 seems to be only available as > source... if I build it, presumbly I can keep the other? > unfortunately, I have no idea, haven't tried to use both on the same machine, but as far as I can see, why not. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] hb_gui failing to authenticate... although it hasworked in the past
Hi, General Linux-HA mailing list wrote: > > I've checked by connecting to the server using hb_gui on the same subnet > so I know it's not firewall related. It has just inexplicitly stopped > working! > > Does any one have any other ideas as to why the hb_gui connection has > stopped working? Is there any other way to set up a virtual ip without > using hb_gui? I had a similar problem on openSUSE 10.2. Check /etc/pam.d/hbmgmtd, and change pam_unix.so to pam_unix2.so. That fixed the problem for me on openSUSE, despite I just checked on SLES10 SP1, and I only have pam_unix.so there but is working. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Xen memory allocation in the cluster
Hi, when a virtual Xen machine is migrating from one node to an ohter, the remaining virtual hosts on the original node, could potentially allocate more memory, and on the new node, the virtual hosts already there have to give up sth. of their Memory, to make room for the new one. As far as I know, Xen is not able to handle this automatically, but it can be manually set via xm mem-set command. I am looking for a OCF resource script that would raise/lower memory usage on virtual nodes automagically. Does there exist sth. like this already? kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Xen live resource migration
Hi all, does there is a good reason why the allow_migrate parameter of the Xen resource script is no included in the meta-data output? I had to read the Xen resource script to stumble across that parameter and it took me another while to figure out where to specify it. If there is no real good reason to not make the parameter available in the meta-data, then I'd create a patch for it to add it there. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] GUI fails to authenticate hacluster user
Hi, > > > > in /etc/pam.d the hbmgmtd file is there: > > cat hbmgmtd > > authrequiredpam unix.so > > account requiredpam unix.so > > > > another GUI instance, still running since yesterday, was still working > > fine. > > > > any idea what could have caused this behaviour? > > > > kind regards > > Sebastian > > use pam unix2.so > thanks a lot, that helped. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] GUI fails to authenticate hacluster user
Hi, system: opensuse 10.2, i586 hbversion: 2.1.2 I just tried to use the GUI on one of our clusters, and login as hacluster user. The login did not worked. In the logs I can see the following messages: Aug 24 14:18:21 ogo2 mgmtd: pam_unix(hbmgmtd:auth): authentication failure; logname= uid=0 euid=0 tty= ruser= rhost= user=hacluster Aug 24 14:18:23 ogo2 mgmtd: [10035]: ERROR: on_listen pam auth failed the password is correct, I tried to su hacluster, using the same password. in /etc/pam.d the hbmgmtd file is there: cat hbmgmtd authrequiredpam_unix.so account requiredpam_unix.so another GUI instance, still running since yesterday, was still working fine. any idea what could have caused this behaviour? kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cluster chokes after upgrade from 2.0.8 to 2.1.2
Hi, crmd[19331]: 2007/08/23_20:21:48 WARN: msg_to_op(1173): failed to get the value of field lrm_opstatus from a ha_msg crmd[19331]: 2007/08/23_20:21:48 info: msg_to_op: Message follows: crmd[19331]: 2007/08/23_20:21:48 info: MSG: Dumping message with 13 fields crmd[19331]: 2007/08/23_20:21:48 info: MSG[0] : [lrm_t=op] crmd[19331]: 2007/08/23_20:21:48 info: MSG[1] : [lrm_rid=IP_SysLog] crmd[19331]: 2007/08/23_20:21:48 info: MSG[2] : [lrm_op=monitor] crmd[19331]: 2007/08/23_20:21:48 info: MSG[3] : [lrm_timeout=5] crmd[19331]: 2007/08/23_20:21:48 info: MSG[4] : [lrm_interval=1] crmd[19331]: 2007/08/23_20:21:48 info: MSG[5] : [lrm_delay=3] crmd[19331]: 2007/08/23_20:21:48 info: MSG[6] : [lrm_targetrc=-2] crmd[19331]: 2007/08/23_20:21:48 info: MSG[7] : [lrm_app=crmd] crmd[19331]: 2007/08/23_20:21:48 info: MSG[8] : [lrm_userdata=38:81:cf036593-e41b-4560-8215-be1aaf753b91] crmd[19331]: 2007/08/23_20:21:48 info: MSG[9] : [(2)lrm_param=0x80a6d8(373 461)] crmd[19331]: 2007/08/23_20:21:48 info: MSG: Dumping message with 15 fields crmd[19331]: 2007/08/23_20:21:48 info: MSG[0] : [target_role=started] crmd[19331]: 2007/08/23_20:21:48 info: MSG[1] : [CRM_meta_interval=1] crmd[19331]: 2007/08/23_20:21:48 info: MSG[2] : [ip=192.168.102.39] crmd[19331]: 2007/08/23_20:21:48 info: MSG[3] : [CRM_meta_prereq=fencing] crmd[19331]: 2007/08/23_20:21:48 info: MSG[4] : [CRM_meta_start_delay=3] crmd[19331]: 2007/08/23_20:21:48 info: MSG[5] : [CRM_meta_role=Started] crmd[19331]: 2007/08/23_20:21:48 info: MSG[6] : [cidr_netmask=23] crmd[19331]: 2007/08/23_20:21:48 info: MSG[7] : [CRM_meta_id=873df73f-b63a-4645-91d9-7921eec339a1] crmd[19331]: 2007/08/23_20:21:48 info: MSG[8] : [broadcast=192.168.103.255] crmd[19331]: 2007/08/23_20:21:48 info: MSG[9] : [CRM_meta_timeout=5] crmd[19331]: 2007/08/23_20:21:48 info: MSG[10] : [CRM_meta_on_fail=fence] crmd[19331]: 2007/08/23_20:21:48 info: MSG[11] : [crm_feature_set=2.0] crmd[19331]: 2007/08/23_20:21:48 info: MSG[12] : [CRM_meta_disabled=false] crmd[19331]: 2007/08/23_20:21:48 info: MSG[13] : [CRM_meta_name=monitor] crmd[19331]: 2007/08/23_20:21:48 info: MSG[14] : [nic=bridge0] crmd[19331]: 2007/08/23_20:21:48 info: MSG[10] : [lrm_callid=41] crmd[19331]: 2007/08/23_20:21:48 info: MSG[11] : [lrm_app=crmd] crmd[19331]: 2007/08/23_20:21:48 info: MSG[12] : [lrm_callid=41] also when I add another node to the cluster, I see above messages, and the cluster starts a short time later stonithing itself. Sebastian Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA mailing list wrote: > > > Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > On Thu, Aug 23, 2007 at 09:43:09AM +0200, Sebastian Reitenbach wrote: > > > Hi, > > > > > > Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA > > > mailing list wrote: > > > > Hi list, > > > > > > > > after upgrading a two node cluster from 2.0.8 to 2.1.2, running on > SLES > > > 10, > > > > x86_64, I see every 17 seconds the following line in the logs: > > > > > > > > lrmd[3651]: 2007/08/22_18:28:30 notice: Not currently connected. > > > > > > > > should I worry about that note? > > > > > > after recreating the whole configuration via the GUI point n' click > orgy, > > > this notice disappeared. Also the problem described below is gone, the > > > cluster seems to behave just fine now. > > > > > > > > > > > This happens when one node is stopped. Adding the second node to the > > > > cluster, then the IPaddr resources start to going crazy. It seems that > are > > > > always the last IP addresses that are configured in the resources > cib.xml > > > at > > > > the end that fail. Some of the IPaddr resources have no problem. The > > > > configuration worked for weeks with heartbeat 2.0.8. > > > > > > > > heartbeat spams about 5MB/minute into the logfiles, therefore I do not > > > want > > > > to append them here (: > > > > > > In case anybody is interested in logfiles/configuration old and new one, > I > > > can open a bugzilla entry. > > > > Yes. If it's not too much trouble. Do you still have your old > > configuration saved? Did you try to find differences between the > > old and the new one with crm_diff? The whole thing seems to be > > quite strange. > crm_diff produces a looong output, the resoure ID's are different, > nevertheless, I created a bugreport: > > http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1694 > > I am not perfectly sure about whether the log really is from the time of a > problem, but If I see t
Re: [Linux-HA] cluster chokes after upgrade from 2.0.8 to 2.1.2
Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > On Thu, Aug 23, 2007 at 09:43:09AM +0200, Sebastian Reitenbach wrote: > > Hi, > > > > Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA > > mailing list wrote: > > > Hi list, > > > > > > after upgrading a two node cluster from 2.0.8 to 2.1.2, running on SLES > > 10, > > > x86_64, I see every 17 seconds the following line in the logs: > > > > > > lrmd[3651]: 2007/08/22_18:28:30 notice: Not currently connected. > > > > > > should I worry about that note? > > > > after recreating the whole configuration via the GUI point n' click orgy, > > this notice disappeared. Also the problem described below is gone, the > > cluster seems to behave just fine now. > > > > > > > > This happens when one node is stopped. Adding the second node to the > > > cluster, then the IPaddr resources start to going crazy. It seems that are > > > always the last IP addresses that are configured in the resources cib.xml > > at > > > the end that fail. Some of the IPaddr resources have no problem. The > > > configuration worked for weeks with heartbeat 2.0.8. > > > > > > heartbeat spams about 5MB/minute into the logfiles, therefore I do not > > want > > > to append them here (: > > > > In case anybody is interested in logfiles/configuration old and new one, I > > can open a bugzilla entry. > > Yes. If it's not too much trouble. Do you still have your old > configuration saved? Did you try to find differences between the > old and the new one with crm_diff? The whole thing seems to be > quite strange. crm_diff produces a looong output, the resoure ID's are different, nevertheless, I created a bugreport: http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1694 I am not perfectly sure about whether the log really is from the time of a problem, but If I see the problem again, I'll readd some new logs. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cluster chokes after upgrade from 2.0.8 to 2.1.2
Hi, Sebastian Reitenbach <[EMAIL PROTECTED]>,General Linux-HA mailing list wrote: > Hi list, > > after upgrading a two node cluster from 2.0.8 to 2.1.2, running on SLES 10, > x86_64, I see every 17 seconds the following line in the logs: > > lrmd[3651]: 2007/08/22_18:28:30 notice: Not currently connected. > > should I worry about that note? after recreating the whole configuration via the GUI point n' click orgy, this notice disappeared. Also the problem described below is gone, the cluster seems to behave just fine now. > > This happens when one node is stopped. Adding the second node to the > cluster, then the IPaddr resources start to going crazy. It seems that are > always the last IP addresses that are configured in the resources cib.xml at > the end that fail. Some of the IPaddr resources have no problem. The > configuration worked for weeks with heartbeat 2.0.8. > > heartbeat spams about 5MB/minute into the logfiles, therefore I do not want > to append them here (: In case anybody is interested in logfiles/configuration old and new one, I can open a bugzilla entry. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] cluster chokes after upgrade from 2.0.8 to 2.1.2
Hi list, after upgrading a two node cluster from 2.0.8 to 2.1.2, running on SLES 10, x86_64, I see every 17 seconds the following line in the logs: lrmd[3651]: 2007/08/22_18:28:30 notice: Not currently connected. should I worry about that note? This happens when one node is stopped. Adding the second node to the cluster, then the IPaddr resources start to going crazy. It seems that are always the last IP addresses that are configured in the resources cib.xml at the end that fail. Some of the IPaddr resources have no problem. The configuration worked for weeks with heartbeat 2.0.8. heartbeat spams about 5MB/minute into the logfiles, therefore I do not want to append them here (: Is this known to somebody else? kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] heartbeat debug rpm packages for suse 10.2?
Hi, I am running heartbeat-2.1.0 on openSUSE 10.2, downloaded from here: http://download.opensuse.org/repositories/server:/ha-clustering/openSUSE_10.2/x86_64/ In /var/log/messages, I saw the lrmd crashing, and restarting: Aug 1 11:29:37 srv5 tengine: [10166]: info: process_graph_event: Detected action resource_PUB_IPS_monitor_0 from a different transition: 190 vs. 195 Aug 1 11:29:37 srv5 tengine: [10166]: info: match_graph_event: Action resource_PD_NFS_monitor_0 (15) confirmed on srv5 Aug 1 11:29:37 srv5 cib: [12892]: info: write_cib_contents: Wrote version 0.122.21866 of the CIB to disk (digest: 8e6e22d91098f6d62a3b5fb6dc1965c2) Aug 1 11:29:37 srv5 heartbeat: [9640]: WARN: Exiting /usr/lib64/heartbeat/lrmd -r process 9840 killed by signal 11 [SIGSEGV - Segmentation violation]. Aug 1 11:29:37 srv5 heartbeat: [9640]: ERROR: Exiting /usr/lib64/heartbeat/lrmd -r process 9840 dumped core Aug 1 11:29:37 srv5 heartbeat: [9640]: ERROR: Respawning client "/usr/lib64/heartbeat/lrmd -r": Aug 1 11:29:37 srv5 heartbeat: [9640]: info: Starting child client "/usr/lib64/heartbeat/lrmd -r" (0,0) Aug 1 11:29:37 srv5 heartbeat: [12935]: info: Starting "/usr/lib64/heartbeat/lrmd -r" as uid 0 gid 0 (pid 12935) unfortunatley the lrmd does not contain debugging symbols: srv5:/ # gdb -c core GNU gdb 6.5 Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-suse-linux". Core was generated by `/usr/lib64/heartbeat/lrmd -r'. Program terminated with signal 11, Segmentation fault. #0 0x0040801a in ?? () (gdb) symbol-file /usr/lib64/heartbeat/lrmd Reading symbols from /usr/lib64/heartbeat/lrmd...(no debugging symbols found)...done. Using host libthread_db library "/lib64/libthread_db.so.1". (gdb) bt #0 0x0040801a in g_str_equal () #1 0x0040835c in g_str_equal () #2 0x2b301fc80e29 in ?? () #3 0x006acd58 in ?? () #4 0x in ?? () are there debug rpm's available that include debugging symbols? kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] two node cluster preventing split brain?
Hi list, I am running heartbeat-2.1.0 on openSUSE 10.2. I use to use ssh to the ilo of the servers to stonith them, stonith generally works well. I configured the two ilo ip addresses as ping nodes in /etc/ha.d/ha.cf. I have a clone set pingd, a clone set stonith, and a clone set suicide defined, each with max_clones=2 and max_clone_node=1. Stonith is enabled in the cluster. Now my test scenario: - I remove the cables from the ilo boards - the heartbeat correctly detects both as down - then I remove the network cables from one host Here the split brain situation happens, well, stonith cannot work, as the ilo boards are not reachable by any host. Here my questions: Do I can define constraints in the cluster or operations on the suicide resource, or pingd resource, so that the actual DC will stay alive, and the other node suicides itself? How is suicide intended to work? Unfortunately it is not a script that I can just read? kind regards Sebastian autojoin any crm true deadtime 15 initdead 30 keepalive 2 ping 192.168.0.96 192.168.0.97 node srv4 node srv5 mcast bond0 224.0.0.1 700 1 0 cluster MyCluster ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Failover for multiple xDSL/FW
Hi, > > Hi to All! > > > > If I have this example configuration: > > > > ROUTER1--- FW1 > > - LAN/Client > > ROUTER2---FW2 > > > > > > ROUTER1 = 80.0.0.0/29 > > ROUTER2 = 90.0.0.0/29 > > > > FW = Linux > > FW1 (LAN) = 192.168.0.253 > > FW2 (LAN) = 192.168.0.252 > > > > GW Client LAN = 192.168.0.254 (HA) > > > > can I use LinuxHA for this solution? You could, but when you only want to do NAT from inside out, and port redirection from outsinde to internal servers, and want to have two different static routes, I do this with OpenBSD pf firewall and carp. In case of failover it only takes a second, the connection states (tcp, udp, whatever), are synchronized between the two nodes, and if you want to use it as IPsec VPN endpoint, IPsec flows and associations are synchronized too. So in case of a failover, nobody would recognize a broken connection. LinuxHA would take much more time to failover. When you need dynamic routing, OpenBSD comes with OpenBGPd and OpenOSPFd. But LinuxHA should work for that too, with a bit slower failover, and without the synchronized firewall and ipsec states. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Finding out which host is active...
Hi, General Linux-HA mailing list wrote: > Hi, > > I have a two node cluster failing an IP back and forth. > Is there an easy way to determine which host is currently holding the ip > address? just run the crm_mon tool. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] HeartBeat doesn't see my process is down.
Hi, General Linux-HA mailing list wrote: > > Hello, > > I configured Heartbeat 2.0.8-2.el4 to start my web server Apache (in > haresources) on Linux when it starts. > The heartbeat configuration runs well in case of hardware crashes but if > the web server goes down only, Heartbeat doesn't see Apache is down and doesn't > send a message to the second server to start its Apache. > I check the script /etc/init.d/httpd status and it returns code 1. > So I don't understand where is the problem ? do you have a monitor action defined for the resource? http://www.linux-ha.org/ClusterInformationBase/Actions Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Help Getting Started
Hi, > It appears that I already have a cib.xml in /var/lib/heartbeat/crm/ from > the initial set up. > > ignore_dtd="false" ccm_transition="8" num_peers="2" > cib_feature_revision="1.3" dc_uuid="b0bd581b-950c-4fa9-ad25-b1f288b > 03123" epoch="7" num_updates="47" cib-last-written="Thu Jul 19 19:49:34 > 2007"> > > > > type="normal"/> > type="normal"/> > > > > > this is the initial cib.xml, created from the information configured in /etc/ha.d/ha.cf, therefore only the nodes are in there. > > Presumably, if I use the GUI, it will add things to this default file? > yes, exactly, it will add the resources, constraints... there. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Help Getting Started
Hi, > > So, can anyone tell me how I can proceed? I guess that the next step is Do you have the GUI installed to? then use that to create the first resources. This will create an initial cib.xml file in /var/lib/heartbeat/crm. > to create a cib.xml file which specifies the virtual ip address? Also > in the ha.cf sometimes it shows 'crm on' and other times 'crm yes'... > which is correct? both are correct, also a true or 1 would be the same. Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] active active failover NFS server?
Hi, > > Thanks a lot, there I found a link to an active-active NFS HA tutorial at > > http://chilli.linuxmds.com/~mschilli/NFS/active-active-nfs.html, after fiddling around with the files there I got it working on the command line, running /etc/ha.d/resource.d/nfs servername start or etc/ha.d/resource.d/nfs servername stop to mount and umount the filesystems and export them. When I configure the resource to be managed from heartbeat, then the script is started with no parameters when started. The script seems to be for an older version, 1.X, of heartbeat. Does it is only in the wrong place, or not compatible anymore and needs more tweaking? I am using heartbeat-2.1.0 on opensuse 10.2. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] how many resources does linux-ha can handle?
Hi, > > > > > > I don't want to add my resources as cib file here, because it is more > than > > > 20 pages printed out :) > > Well you could attach the CIB (bzip2 is your friend). Without it no one > can help > > here. So we can see which resource failed and maybe where the problem in > > your configuration is. > > > I think the hint with the timings is good, and I'll first try to change the > script. If that doesn't help, then I'll post it here. > just for the records: I changed the IPaddr script to handle a group of IP addresses. That reduced the number of resources a lot. Additionally I had to add an order for the IP resources to make it work smoothly. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat-gui documentation?
Hi, > > Or at least tell me what username & password to give it to logon? Using > my server's root account doesn't seem to work it is the password of the user that runs the heartbeat daemon, in Linux usually hacluster. You have to assign a password to that user in the system. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] alias IP addresses on OpenBSD
Hi, I found this snippet in the IPaddr script: find_interface_bsd() { #$IFCONFIG $IFCONFIG_A_OPT | grep "inet.*[: ]$OCF_RESKEY_ip " $IFCONFIG | grep "$ipaddr" -B20 | grep "UP," | tail -n 1 | cut -d ":" -f 1 } # # Find out which alias serves the given IP address # The argument is an IP address, and its output # is an aliased interface name (e.g., "eth0:0"). # To see the aliases of an interface shown by ifconfig, you have to add the parameter -A to ifconfig: but running the command gives me: ifconfig -A | grep 213.239.221.55 -B20 | grep "UP," | tail -n 1 | cut -d ":" -f 1 but then it still gives me: fxp0 because there are no fxp0:0 or eth0:0 alias interfaces on OpenBSD. But when I take a look into the delete_interface function in the same script, it looks like it will work. here an example output of ifconfig ifconfig -A other interfaces ... fxp0: flags=8843 mtu 1500 lladdr 00:02:b3:88:e6:41 groups: egress media: Ethernet autoselect (100baseTX full-duplex) status: active inet 21.39.21.41 netmask 0xffe0 broadcast 21.39.21.63 inet6 fe80::202:b3ff:fe88:e641%fxp0 prefixlen 64 scopeid 0x1 inet 21.39.21.55 netmask 0xff00 broadcast 21.39.21.255 other interfaces... the line inet 21.39.21.55 netmask 0xff00 broadcast 21.39.21.255 defines the alias IP address. This line will not show up, without the -A parameter. well, I only tried to run the command manually on the commandline not from within linux-ha, but as far as I can see, the script would not find the ip address of the interface. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] how many resources does linux-ha can handle?
Hi, > From my experience I recommend not to use the GUI. It never did the job > for me (specially not creating a configuration). yeah, the GUI is a bit... > > From your description I assume you have timing problems. Keep in mind that > cluster node startups really generate a load on the HA system. Each resource > is probed (basically it runs a 'monitor' operation on each resource on each > cluster node). > > So if you have 2 nodes with 40 resources a node startup ---> 80 monitor actions > initiated ---> 80 responses ---> 80 changes in the CIB --> 80 redistributions > (not to mention the engine calulcating your failove-rules for all the resources). that's a good hint, thanks. > > Did you write Resource Agents on your own? Or do you use only the standard > HA RA? I only use the standard resource files in that cluster, only for stonith I use self written scripts to kill the other nodes via ssh to the iLo board. > > Are you using clones? I have about 9 clone sets, and the same number of groups, each group containing 9 or 10 resources. Maybe I can try changeing the IPaddr script to allow me to give it a list of IP addresses, and a list of devices. then each group only consists of 2 or less resources. As far as I can see, that could drop the load from the server and maybe fix the timing problems. > > You see ... attaching the CIB as attachment would help ;-) > > > > > I don't want to add my resources as cib file here, because it is more than > > 20 pages printed out :) > Well you could attach the CIB (bzip2 is your friend). Without it no one can help > here. So we can see which resource failed and maybe where the problem in > your configuration is. > I think the hint with the timings is good, and I'll first try to change the script. If that doesn't help, then I'll post it here. thanks Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] active active failover NFS server?
Hi, "Sebastian Reitenbach" <[EMAIL PROTECTED]> wrote: > Hi, > > > > > You can start here: http://linux-ha.org/HaNFS > > > > > > Thanks a lot, there I found a link to an active-active NFS HA tutorial at > http://chilli.linuxmds.com/~mschilli/NFS/active-active-nfs.html, > unfortunately I do not get an IP for the hostname, therefore I only found it > in Google Cache. They use exportfs [-u] to add or remove mount points on the > nfs servers. > There is a HA-NFS.tar mentioned, to be downloaded from the same site, > anybody has this somewhere else available? > Nevermind, I got that file, the server was reachable again. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] how many resources does linux-ha can handle?
Hi list, I am trying to setup a two node cluster with lots of cloned services, ldap, dns, squid, cups, tftp, and active active nfs,... Each of the two nodes is a member in nine vlan's. For each service, a group of 9 virtual addresses is configured. Every resource is monitored, and in case it fails, the node should be fenced. Up to about 40 or 50 resouces, everything is working as expected. when suspending or reactivating the cluster, some resources start to fail and the GUI becomes so unresponsive, so that I have to restart it. when I add more resources, everything gets more wild, so far, that when I suspend or rejoin a node via the GUI, the GUI freezes, and then the crm_mon too is unable to connect to the cluster, on any node, so that the heartbeat has to be restarted. I don't want to add my resources as cib file here, because it is more than 20 pages printed out :) how many resources does the linux-ha cluster can manage? would it help to tweak some timings, if so, which would that be? Or would it help to reduce the load when I e.g. change the IPaddress resource to manage a group of aliases, for each vlan? any experiences and hints appreciated. kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] active active failover NFS server?
Hi, > > You can start here: http://linux-ha.org/HaNFS > > > Thanks a lot, there I found a link to an active-active NFS HA tutorial at http://chilli.linuxmds.com/~mschilli/NFS/active-active-nfs.html, unfortunately I do not get an IP for the hostname, therefore I only found it in Google Cache. They use exportfs [-u] to add or remove mount points on the nfs servers. There is a HA-NFS.tar mentioned, to be downloaded from the same site, anybody has this somewhere else available? kind regards ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] active active failover NFS server?
Hi list, I am going to build an active active NFS server, where one exports a public directory, and the other the home directories. In case one fails, both should be exported by the remaining one server. I have a shared storage on a SAN, connected to both servers, I use the Filesystem ocf script to mount/umount the partitions (ext3, ocfs2 doesn't have ACL's, and I do not get GFS2 to work). Therefore I cannot run a nfs server clone, because I cannot umount the partition when the nfs server still lives on it, and the shared IP is wandering. I only see the LSB Script available for managing the nfsserver, but with the LSB script, only one NFS server can be started or stopped. So I have to configure two NFS resources using the LSB script, so that both can life on different servers. But now when I manually tell on nfs resource to move to another server, then both nfs resources will not be available for a short time. I also saw some problems when a dead node comes back into the cluster, also both nfs server resouces were not available for a short time. An other option would be to create a OCF script (I haven't found one) to manage the nfsserver. In the manual page of rpc.mountd I have seen that it is possible to specify a exports file and the port automatically. But I don't know what kind of other problems I might get, or whether it will be possible to run two nfsservers in parallel. anybody has an idea? kind regards Sebastian ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems