Dear friends, contacts and colleagues, present and past,
The tragedy of the earthquake, tsunami and nuclear emergency in northern
Japan has had a particular impact on my family: Akiko, my wife, comes
from Fukushima, and her family, friends and relatives are all directly
affected. I have never r
> Last updated: Fri Dec 7 10:18:29 2007 Current DC: NodeA
> (2a7021a1-ab44-403d-80a4-5ff9b4e24fcc) 2 Nodes configured. 1
> Resources configured.
>
> Node: NodeA (2a7021a1-ab44-403d-80a4-5ff9b4e24fcc): standby Node:
> NodeB (296a344e-4ca5-4aae-be0b-7fc4473a7e05): online
>
> First question: Are you interested in my patch (just 2-3 lines)?
Most likely ;) Although I'm not completely sure how case is handled
elsewhere. We might be case-sensitive on purpose! (although I can't see
a good reason to do this for host names). Maybe send to the -dev list?
> Second: Any idea
FWIW, the "Exploring HASF" document does NOT use private containers. It
uses shared containers, so you will need to adjust some things, in
particular the type of CSM container (shared -> private), and use the
evms_failover RA instead of the evmsSCC RA.
Yan
Chris wrote:
> Hi Andrew,
>
Andrew Beekhof wrote:
>
> On Nov 21, 2007, at 10:11 AM, Christian Zemella wrote:
>
>> Hi All,
>>Anybody out there managed to have EVMS container resources
>> properly failing over in a 2 node Heartbeat 2 cluster running on SLES
>> 10 SP1 ?
>
> I believe so... have you read the documentat
Alan Robertson wrote:
> Yan Fitterer wrote:
>> Not always. The case I have encountered (live) doesn't relate to HB
>> component failure per se, but is nevertheless destructive.
>>
>> With an eDirectory load (and other database-backed software with large
>>
Dominik Klein wrote:
> I wrote my own little RA to start a custom binary. Very basic RA up to now.
>
> I start my binary with
> nohup $binfile $cmdline_options >> $logfile 2>> $errorlogfile &
>
> Works ok actually, the logfiles are filled as expected, but I also see
> some of the output in the Li
Can you provide some sample xml output for the problem situation(s)? Do
you have new entries with new unique IDs, or multiple IP attributes for
the one resource, or ???
More detail would help... including version of HB, xml samples, etc...
Yan
Szasz Tamas wrote:
> Hi,
>
> When I use the cibadmi
Andrew Beekhof wrote:
>
> On Nov 7, 2007, at 3:12 PM, Yan Fitterer wrote:
>
>> My 2c... Although my experience is rather limited, I have encountered
>> one real-life situation where ssh would not have worked. (split brain
>> created by putting firewall in "c
Andrew Beekhof wrote:
>
> On Nov 7, 2007, at 9:16 AM, matilda matilda wrote:
>
> Alan Robertson <[EMAIL PROTECTED]> 07.11.2007 05:51 >>>
Actually there a plenty of people using it today.
I'd much prefer they had a real device, but they are aware of the risks
and seem hap
Not always. The case I have encountered (live) doesn't relate to HB
component failure per se, but is nevertheless destructive.
With an eDirectory load (and other database-backed software with large
or lazily flushed write buffers would be similarly affected, IMHO), a
hard reset of a node has a hig
Welisson wrote:
> Hi all,
>
> I am with the same problem, in relation to heartbeat, as it follows below in
> the e-mail.
> I tested handle, I increased the value of deadtime, and nothing it decided.
> I would like to know, if this could be some problem in relation to kernel,
> because I am usin
Why ?
i.e. why does the value have to be negative?
The "failover score" is calculated with abs(failure stickiness), so some
other calculation must rely on that value being negative, but the
details escape me.
Thanks
Yan
PS Apologies if this has been answered somewhere - I couldn't find it.
>> there is no 'cib' process.
>
> actually there is :-)
Oops - Thanks for the correction Andrew!
>
>> If I understand things right, the crmd
>> process handles all core CIB maintenance operations.
>
> nope, all done by the CIB process
Makes sense
__
I'd like to reorder the primitives in that group:
Resource Group: web
resource_web_ip (heartbeat::ocf:IPaddr2)
resource_web_ip_27 (heartbeat::ocf:IPaddr2)
resource_web_fs_ww (heartbeat::ocf:Filesystem)
resource_web_fs_cache (heartbeat::ocf:Filesystem)
resource_web
Junko IKEDA wrote:
once again something about SplitBrain...
During SplitBrain, I wrecked the resource on the both nodes.
fail count was increased at this time.
But after recovering from SplitBrain, fail count returned to zero on
both!
Is this due to the restart of crmd or pengine/tengine?
Mo
Hi Johan,
Thanks for reporting this bug. Comments below:
It assumes that you use an ip-address for the n4u.server.interfaces
parameter.
Yes, with hindsight, that was an unfortunate assumption...
It only supports one single interface
You're correct here, I need to cater for that as well,
heartbeat-2.1.2-3.el5.centos
heartbeat-gui-2.1.2-3.el5.centos
mmm I don't use centos, but maybe somebody else on the list can confirm
if that GUI package is known to work?
There are still a number of things only possible with the command line
tools, so you will probably end up learning the
Whenever this happens, I can't get the heartbeat service to restart.
Something (I believe it was mgmtd last time) won't quit without a
`kill -9`. I find the easiest solution to force a reboot and let it
start over. When it comes back up, it functions properly again.
HB version? Platform? Source
DRBD operates below the filesystem (Distributed Replicated Block
Device), effectively replicating each "disk" block between two hosts.
So, yes, AFAIK, the filesystems will be truly identical (assuming
up-to-date sync, of course...)
Yan
Stefan Lasiewski wrote:
>> I don't know this Replicator, does
NFS failover requires inodes to be kept in sync between the two servers.
I.E. both filesystems _must_ be _IDENTICAL_ down to the very last bit...
See:
http://lists.community.tummy.com/pipermail/linux-ha/2007-April/024652.html
and other messages in that thread, where the issue was discussed.
S
to
> see that even if just one node remains, my services keep running anyway.
>
> Sander
>
>
>> -Oorspronkelijk bericht-
>> Van: [EMAIL PROTECTED] [mailto:linux-ha-
>> [EMAIL PROTECTED] Namens Yan Fitterer
>> Verzonden: woensdag 27 juni 2007 14:43
STONITH.
Second node fails: 3rd node takes over resources, but only after
verified power off (or restart) of 2nd node.
Actually - same thing for 1st node.
Challenge: Ensure that you don't lost network AND stonith at the same time.
Yan
Sander van Vugt wrote:
> Hi list,
>
>
>
> Trying to thi
> All the other apps where referenced by their full /path/to/executable.
> Why wasn't FSCK? Secondly, without a path shouldn't it have looked in
> $PATH or is $PATH blank for the user that Filesystem runs as? If the
> later is the case, why would 'fsck' work then? Is this worthy of a bug
> being
I don't *think* this is the cause of your failed start, but if you want
to be sure, change contents of /etc/HOSTNAME (where the hostname is
recorded) to lowercase, and restart!
Sander van Vugt wrote:
> Hi List,
>
> I have a strange problem, would appreciate help.
>
> My two node names are both
You may want to use pingd:
http://linux-ha.org/pingd
I'm assuming you have redundant paths between your nodes, of course. If
not, time to upgrade your setup
Dominik Klein wrote:
Hi
I would like to manage xen domUs on drbddisk backend. This works fine
for now.
Now Heartbeat should swi
I know various people on the list were waiting for this, so in case news
didn't make it out to everybody...
http://download.novell.com/Download?buildid=2FNtOnmkx-w~
At the moment, the link is in the "new releases" on download.novell.com
Yan
___
Linux
There is a slight catch with this, in so far as using the --meta switch
updates the attribute in question in a different node in the cib xml.
BUT, I think (at least for some versions, the issue may be fixed in the
latest code) that there are cases where one can end up with the same
attribute in bo
Personally, my preferred method is to make all resources unmanaged (is_managed
= false), then stop hb on all nodes, install upgrade, restart heartbeat, leave
to settle, then put resources back to managed state.
I think it's possible to do a rolling upgrade, one node at a time, but I prefer
to d
After the reboot, is the new node name resolvable by all nodes?
(/etc/hosts, or DNS) ?
As well, you'll need to update ha.cf with the new node name...
Yan
Jaime Medrano wrote:
> Hi.
>
> I'm having problems when a one of the nodes changes its hostname (uname
> -n).
>
> In first place, if the hos
I'm not quite up-to-date on the IBM cards, but I've heard that the
latest models have dropped the clustering support (i.e. the self-fencing
functionality).
Just wanted to pass on the warning - but DO check more closely, I may be
wrong in your case. (the 6M doesn't sound recent...)
Yan
George H w
Where does the heartbeat 2.0.7 binaries come from? What packaging? What
distribution?
It's starting to sound (to me at least ;) like some broken heartbeat
package.
Yan
PS - it is not normal for the CIB to be rewritten every few seconds. It
should be rewritten when something in the cluster state
None of the existing RAs seem to deal with routes (apart from ensuring
the correct network routes are inserted for the VIPs).
So the correct way to implement this functionality would be to write a
new OCF RA to manipulate routes, and add a new resource with it on top
of the IP resource.
Yan
a matilda wrote:
> Hi Yan,
>
> PLEASE: Where is SLES 10 SP1. I'm waiting for that, but can't find it.
>
> Best regards
> Andreas Mock
>
>
>>>> Yan Fitterer <[EMAIL PROTECTED]> 04.06.2007 15:23 >>>
> I _think_ that 2.0.7 had a bu
Actually, just seen that the SP1 hasn't _quite_ made it out the door
yet. Any time soon, though! Sorry for the confusion.
Yan Fitterer wrote:
> I _think_ that 2.0.7 had a bug on that that was fixed later (2.0.8?).
> Novell has just released SP1, with a vastly improved Heartbeat pac
I _think_ that 2.0.7 had a bug on that that was fixed later (2.0.8?).
Novell has just released SP1, with a vastly improved Heartbeat package.
Try reproducing the issue with the upgraded version.
matilda matilda wrote:
> Hi all,
>
> after reading almost all stuff on linux-ha.org, digging around in
sorry - typo below. not ndsdstat, but netstat -ntlp
Yan Fitterer wrote:
> ha.cf + output of ifconfig + output of ndsdstat -ntlp from both nodes
> please?
>
> Following:
> http://linux-ha.org/ReportingProblems
> when reporting problems is a "Good Idea (tm)"...
>
I think you misunderstand how X works. The "X server" actually runs on
your Windows workstation. With xming, you run the xming software on your
Windows workstation, then launch hb_gui on the Linux server, but the
graphical display appears on your Windows desktop.
You do NOT need the Linux box to b
ha.cf + output of ifconfig + output of ndsdstat -ntlp from both nodes
please?
Following:
http://linux-ha.org/ReportingProblems
when reporting problems is a "Good Idea (tm)"...
Nick Peterson wrote:
> Hi there,
>
> I've got heartbeat 2.0.8 on two servers connected with public IPs.
> When I start H
ess the issue of logs or overly brief problem
> descriptions but would be better than nothing
>
> On 6/1/07, Yan Fitterer <[EMAIL PROTECTED]> wrote:
>> Andrew, this is such a common issue (people not giving us version...),
>> is there any way we could include the hb ve
Andrew, this is such a common issue (people not giving us version...),
is there any way we could include the hb version in the cib? We aready
have "cib_feature_revision", but maybe we should have
"heartbeat_version" as well?
Feedback anyone?
Yan
Andrew Beekhof wrote:
logs? version?
come o
man crm_failcount
fail count will not be reset automatically, has to be done manually. As
well, crm_verify and (if any "failed start" or "failed op" appear)
crm_resource -C
It is not possible for the cluster to "forget" about failure counts, as
otherwise a failing resource could be forever bounci
>>> On 19/05/2007 at 07:16, in message
<[EMAIL PROTECTED]>, "Jiann-Ming Su"
<[EMAIL PROTECTED]> wrote:
> On 5/16/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>>
>> For a resource named group_1
>> crm_resource --meta -p target_role -r group_1 -v started
>>
>
> That didn't work... 2.0.7 doesn't
crm_standby
>>> On Fri, May 11, 2007 at 12:10 AM, in message
<[EMAIL PROTECTED]>, "Dan Gahlinger"
<[EMAIL PROTECTED]> wrote:
> Yeah,
> on the primary try "heartbeat stop" (in most cases /etc/init.d/heartbeat
> stop)
>
> that'll do it.
>
> Dan.
>
> On 5/10/07, Howard Yuan <[EMAIL PROTECTED]> wr
Kay,
Sorry I couldn't answer this earlier...
When I run the pe input file on my system here (2.0.7-1.2), the ptest utility
moves all resources to the second node,
as expected. None are left on the first, AFAICS. See output attached. So either
your version of HB comes to a different
conclusion -
>>> On Thu, May 10, 2007 at 3:38 PM, in message
<[EMAIL PROTECTED]>, "Andrew Beekhof"
<[EMAIL PROTECTED]> wrote:
> On 5/10/07, Yan Fitterer <[EMAIL PROTECTED]> wrote:
>> Last I heard, reload was an optional OCF action that was not quite fully
>
gt;
>
> Then I set
> default-resource-stickiness = "100"
> default-resource-failure-stickiness = "0"
>
> Now the resources stay on dl360g3-1 when i start up dl360g3-2, but when I
> make
> the a LSB resource fail on dl360g3-1, Heartbeat suddenly move
Last I heard, reload was an optional OCF action that was not quite fully
implemented in Heartbeat (someone please correct me here if need be!)
The benefit would be that configuration changes could be picked up without
having to interrupt a service. Not sure how it's supposed to be invoked from
I do not believe this is possible. 1 resource == 1 monitor AFAIK.
What you _could_ do, is build the cleverness in the OCF RA to do two types of
checks, and mix them on a temporal basis where appropriate (i.e. have a single
RA run at 10s intervals, but only do (and report on) some of the checks
up the verbosity.
>
> Does this have anything to do with the "stickiness stuff"?? I have
> default-resource-stickiness = "100"
> default-resource-failure-stickiness = "-INFINITY"
Both those will play a role (in particular the -failure- one (in this case, it
Dave Dykstra wrote:
> On Mon, May 07, 2007 at 11:15:49AM +0200, Dejan Muhamedagic wrote:
>> On Fri, May 04, 2007 at 02:01:24PM -0500, Dave Dykstra wrote:
> ...
>>> And of course /etc/init.d scripts don't support a 'status' parameter.
>>> You really need to write your own /etc/ha.d/resources scrip
Haven't looked at too much detail (lots of resources / constraints in
your cib...), but I would approach the problem differently:
Make groups out of related IP / filesystem / service stacks.
Then use the colocation constraints between services (across groups) to
force things to move together (if
Martijn Grendelman wrote:
> Alan Robertson schreef:
>> Martijn Grendelman wrote:
>>> Hi,
>>>
>>> I am trying to build a 2-node cluster serving DRBD+NFS, among other
>>> things. It has been operational on Debian Sarge, with Heartbeat 1.2, but
>>> recently, both machines were upgraded to Debian Etc
The xml below works for me in a 4-node cluser (you need one clone
resource per node - fragment shows clone for first node):
Jure Pečar wrote:
> Hi all,
>
> I'm trying to
One reason I like to order the resources, is that when one has more
resources than can be displayed by crm_mon in a screenful (at least
whilst remaining readable...), it is useful to be able to put things
like STONITH resources at the bottom.
AFAIK Easiest way to achieve resources ordering is to d
What return code should an OCF RA return on a monitor operation when the
service is "running but broken" (for ex. process present, but services not
available)?
If the RA returns OCF_NOT_RUNNING, then will hb do a "stop" before any "start"
when going from "unmanaged" to "managed" for example?
W
If you want to put the node in standby (this will affect all resources
running on that node), check the crm_standby command.
If you want to affect specific resource(s), you will need to use the
crm_resource command.
After putting a node in standby, you need to put it off standby.
Similarly, after
Jose Jerez wrote:
Hello,
I'm trying to set up a cluster system with two machines and a shared
storage (all SLES10 & Heartbeat 2.0.7)
For the shared storage there is an ISCSI target available to both
machines and the management of this common device is done through
EVMS. So far I managed to se
re building.
Not sure about that. I still have about 18 resources that should be treated
symmetrically, and only 1 that doesn't.
In that case, which should win?
>
>
> On 4/20/07, Yan Fitterer <[EMAIL PROTECTED]> wrote:
>> >>> On Fri, Apr 20, 2007 a
>>> On Fri, Apr 20, 2007 at 3:42 PM, in message
<[EMAIL PROTECTED]>, "Andrew Beekhof"
<[EMAIL PROTECTED]> wrote:
> On 4/20/07, Yan Fitterer <[EMAIL PROTECTED]> wrote:
>> >> In the attached pe- warn, why is resource R_audit being started on
&
>> In the attached pe- warn, why is resource R_audit being started on
>> idm01 when there is an INFINITY constraint with uname eq idm04?
>>
>> BTW - idm04 is in standby at the moment. That should hardly matter. I
>> expect the resource to be "cannot run anywhere".
>>
>> I really hope it's not a
In the attached pe-warn, why is resource R_audit being started on idm01 when
there is an INFINITY constraint with uname eq idm04?
BTW - idm04 is in standby at the moment. That should hardly matter. I expect
the resource to be "cannot run anywhere".
I really hope it's not a typo, but I have read
SUSE-built 2.0.8 packages will be out with SP1 soon, but in the
meantime, if you _really_ can't wait...
http://ftp.suse.com/pub/people/lmb/heartbeat/v2.0/
Those should not have dependency issues...
Yan
Thomas Åkerblom (HF/EBC) wrote:
> Hi.
> I'm running heartbeat 2.0.7 on SLES 10.
> When I try t
Oh - and I forgot... UI was _much_ nicer last time I looked. And nowhere
near as buggy as the HB GUI.
Yan Fitterer wrote:
> NCS has better integration with EVMS, and has data-network heartbeat. It
> does not therefore require STONITH.
>
> It has had much more testing than HB for la
NCS has better integration with EVMS, and has data-network heartbeat. It
does not therefore require STONITH.
It has had much more testing than HB for large clusters as well. 20+
node clusters are not uncommon.
Yan
Sander van Vugt wrote:
> Hi,
>
> Just like to know your opinion about the followi
I can't see this working. AFAIK, heartbeat 2.x does not support the
protocols of the 1.x series.
It sounds like you'll have to setup your 2.x system as a new cluster,
then put together a good transition process.
Yan
Patrick Begou wrote:
> I am migrating my HA cluster. At this step I have:
>
> 1
Manual manipulation of cib through /var filesystem is explicitly
discouraged.
Use the cibadmin tool. Heartbeat will synchronize the cib between all
nodes automatically for you.
Yan
PS:
man cibadmin
cibadmin -h
Bernd Schubert wrote:
> Hi,
>
> I think there's a race condition on initializing an
Bernhard Limbach wrote:
> Hello,
>
> For 2.0.8 unidirectional colocations were announced.
>
> Could anybody give me a hint how to configure those ? Is the idea to
> configure a non-infinity score or is there another way ?
AFAIK, nothing special in configuration. They simply imply that for:
A -
>> The UI interface is sometimes not an option because:
>> - the cluster runs on linux without installed X
>
> running x11 on a cluster is not a requirement. running x11 on the
> sysadmin's workstation is:
>
> display/x11/keyboard/mouse/user <== haclient.py <== TCP/IP ==> cluster
>
>> - ther
We rely on the nfs's server init script, as an RA. I don't think we
include a heartbeat-specific NFS Resource Agent...
If the Slackware nfs server init script has the correct (for Heartbeat)
behavior, you can use that as an RA.
See:
http://linux-ha.org/LSBResourceAgent
Alex Orlov wrote:
> HI!
>
You could try acpi=off and noapic on the kernel boot line.
I've seen all sorts of weirdness where ACPI and apic are involved.
At least, that's simple and easy to test :)
Yan
Dejan Muhamedagic wrote:
> On Tue, Mar 27, 2007 at 09:31:16AM +0200, Patrick Begou wrote:
>> Dejan Muhamedagic wrote:
>>
Peter Clapham wrote:
Max Hofer wrote:
On Tuesday 20 March 2007 15:46, Alan Robertson wrote:
Max Hofer wrote:
I have following questions about STONITH:
* can IPMI (ipmitool/openipmi) used for STONITH? (IPMI:
http://en.wikipedia.org/wiki/IPMI most Dell Servers have such
a device in
72 matches
Mail list logo