Re: [Linux-HA] BadThingsHappen with v2.0.5.

2007-04-19 Thread Alan Robertson
Andrew Beekhof wrote:
> On 4/19/07, Peter Kruse <[EMAIL PROTECTED]> wrote:
>> Andrew Beekhof wrote:
>> > then i'm afraid your use of the "dont fence nodes on startup" option
>> > has come back to haunt you
>> >
>> > beosrv-c-1 came up but was not able to find beosrv-c-2 (even though it
>> > _was_ running) and because of that option beosrv-c-1 just pretended
>> > beosrv-c-2 wasn't running and happily started activating resources.
>> >
>> > remember how we said that option wasn't a good idea :-)
>>
>> Hm, I don't understand, beosrv-c-2 fenced beosrv-c-1 in order
>> to take over.  Now you say, that as soon as beosrv-c-1 came back
>> up again, it should fence beosrv-c-2, because it "thought" it
>> was not there, but it was there?  How can this happen?
> 
> usually an enduring communications failure (be it physical or in our
> software) but i'm no expert regarding the membership and
> communications layers
> 
> But I see a lot of messages like:
> Apr 19 09:49:47 beosrv-c-1 heartbeat: [4495]: WARN: Rexmit of seq
> 3553687 requested. 141 is max.
> 
> so _something_ isn't right.
> 
> probably worthy of a bug report.

There have been some bugs in this code in the last year or so.  I've
forgotten what they were, unfortunately.

A hint is the string "ERROR:".  We don't use that word lightly.  If you
get an ERROR: from one of our pieces of code, the chances are 99% that
it shouldn't _ever_ happen.  Getting it hundreds of times like you did
is a really bad sign.

Apr 19 09:48:27 beosrv-c-2 heartbeat: [10763]: ERROR: Message hist queue
is filling up (200 messages in queue)
Apr 19 09:48:27 beosrv-c-2 heartbeat: [10763]: ERROR: Message hist queue
is filling up (200 messages in queue)

What this message normally means is that you have a half-duplex
communication failure.  That is, one node can transmit but not receive,
or vice versa...

Are both systems version 2.0.5?  [I'm guessing not]

Is there a chance that you installed a 2.0.5 pre-release?  Because there
was a bug fix which went in just as 2.0.5 was coming out.

And this fix: http://hg.linux-ha.org/dev/rev/6b8bdf5332c3 which could
have affected you.  How long was this node down?  It looks to me like
either it had been down a very long time, or a very short time.

Which is it?

If it was a very short time, then we have fixed the problem I believe...


This particular sequence of messages is interesting...
Apr 19 09:48:31 beosrv-c-2 heartbeat: [10763]: WARN: 1 lost packet(s)
for [beosrv-c-1] [17:19]
Apr 19 09:48:31 beosrv-c-2 cib: [10790]: info:
mask(callbacks.c:cib_client_status_callback): Status update: Client
beosrv-c-1/cib now has status [join]
Apr 19 09:48:32 beosrv-c-2 heartbeat: [10763]: info: No pkts missing
from beosrv-c-1!
Apr 19 09:48:32 beosrv-c-2 heartbeat: [10763]: ERROR: Message hist queue
is filling up (200 messages in queue)

Here is what these messages mean:

We received message 17 and 19 from beosrv-c-1.  We didn't receive
message 18 from beosrv-c-1.

The code would then ask for packet to be retransmitted from beosrv-c-1.

The CIB received a message from the CIB on beosrv-c-1, indicating that
the CIB process on beosrv-c-1 is now running.

Beosrv-c-1 retransmitted packet 18.

We received packet 18, and now no packets are missing.

The "Message hist queue is filling up" message means we have sent 200
packets without receiving an flow-control ack from someone.  If there
are only two nodes, that would mean beosrv-c-2.

HOWEVER, we can definitely send and receive packets to and from both
machines as witnessed by the "lost packet" followed by the "No pkts
missing" sequence.  This cannot have happened if we had a half-duplex
comm failure.

I know we fixed a couple of bugs in this area, but I'm not sure when the
last one was fixed.  I looked at bugzilla, and if a bugzilla had been
made for every fix, then I don't see an obvious fix which was made after
2.0.5.





-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] BadThingsHappen with v2.0.5.

2007-04-19 Thread Alan Robertson
Peter Kruse wrote:
> Andrew Beekhof wrote:
>> then i'm afraid your use of the "dont fence nodes on startup" option
>> has come back to haunt you
>>
>> beosrv-c-1 came up but was not able to find beosrv-c-2 (even though it
>> _was_ running) and because of that option beosrv-c-1 just pretended
>> beosrv-c-2 wasn't running and happily started activating resources.
>>
>> remember how we said that option wasn't a good idea :-)
> 
> Hm, I don't understand, beosrv-c-2 fenced beosrv-c-1 in order
> to take over.  Now you say, that as soon as beosrv-c-1 came back
> up again, it should fence beosrv-c-2, because it "thought" it
> was not there, but it was there?  How can this happen?

It's a bug :-D.


-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] BadThingsHappen with v2.0.5.

2007-04-19 Thread Andrew Beekhof

On 4/19/07, Peter Kruse <[EMAIL PROTECTED]> wrote:

Andrew Beekhof wrote:
> then i'm afraid your use of the "dont fence nodes on startup" option
> has come back to haunt you
>
> beosrv-c-1 came up but was not able to find beosrv-c-2 (even though it
> _was_ running) and because of that option beosrv-c-1 just pretended
> beosrv-c-2 wasn't running and happily started activating resources.
>
> remember how we said that option wasn't a good idea :-)

Hm, I don't understand, beosrv-c-2 fenced beosrv-c-1 in order
to take over.  Now you say, that as soon as beosrv-c-1 came back
up again, it should fence beosrv-c-2, because it "thought" it
was not there, but it was there?  How can this happen?


usually an enduring communications failure (be it physical or in our
software) but i'm no expert regarding the membership and
communications layers

But I see a lot of messages like:
Apr 19 09:49:47 beosrv-c-1 heartbeat: [4495]: WARN: Rexmit of seq
3553687 requested. 141 is max.

so _something_ isn't right.

probably worthy of a bug report.



Peter

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] BadThingsHappen with v2.0.5.

2007-04-19 Thread Peter Kruse

Andrew Beekhof wrote:

then i'm afraid your use of the "dont fence nodes on startup" option
has come back to haunt you

beosrv-c-1 came up but was not able to find beosrv-c-2 (even though it
_was_ running) and because of that option beosrv-c-1 just pretended
beosrv-c-2 wasn't running and happily started activating resources.

remember how we said that option wasn't a good idea :-)


Hm, I don't understand, beosrv-c-2 fenced beosrv-c-1 in order
to take over.  Now you say, that as soon as beosrv-c-1 came back
up again, it should fence beosrv-c-2, because it "thought" it
was not there, but it was there?  How can this happen?

Peter

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] BadThingsHappen with v2.0.5.

2007-04-19 Thread Andrew Beekhof

On 4/19/07, Peter Kruse <[EMAIL PROTECTED]> wrote:

Hi Andrew!

Andrew Beekhof wrote:
> beosrv-c-2 is the failed node right?

it was beosrv-c-1 that failed, beosrv-c-2 took over.


then i'm afraid your use of the "dont fence nodes on startup" option
has come back to haunt you

beosrv-c-1 came up but was not able to find beosrv-c-2 (even though it
_was_ running) and because of that option beosrv-c-1 just pretended
beosrv-c-2 wasn't running and happily started activating resources.

remember how we said that option wasn't a good idea :-)



>
> do you have logs from there too?

attached (messages about Gmain_timeout removed, there were too many
of them)

The problem now is that cibadmin -m reports:

CIB on localhost _is_ the master instance

on both nodes.

Thanks for your time,

Peter


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] BadThingsHappen with v2.0.5.

2007-04-19 Thread Peter Kruse

Hi Andrew!

Andrew Beekhof wrote:

beosrv-c-2 is the failed node right?


it was beosrv-c-1 that failed, beosrv-c-2 took over.



do you have logs from there too?


attached (messages about Gmain_timeout removed, there were too many
of them)

The problem now is that cibadmin -m reports:

CIB on localhost _is_ the master instance

on both nodes.

Thanks for your time,

Peter



heartbeatlog2.gz
Description: GNU Zip compressed data
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] BadThingsHappen with v2.0.5.

2007-04-19 Thread Andrew Beekhof

On 4/19/07, Peter Kruse <[EMAIL PROTECTED]> wrote:

Hello,

thanks for reading this, as it's with ancient v2.0.5., please tell me
that this problem can not happen with recent version of heartbeat.

Problem description:
yesterday in one of our 2node HA-Clusters a successful takeover
happened, where the failed node was  resetted, so far so good.
After I started heartbeat again on the failed node, it tried
to takeover the resources, although they were running
on the other node (BAD!).


beosrv-c-2 is the failed node right?

do you have logs from there too?


Ok, I detected an error in the setup, /var/lib/heartbeat/pengine
was not writable by hacluster, causing this error message:

pengine: [5580]: ERROR: Cannot write to
/var/lib/heartbeat/pengine/pe-input-0.bz2: Permission denied

Now my question:

Is this error responsible for the faulty behavior of heartbeat?


no.  those files are purely for debugging problems such as the one
you're reporting


Will this error also trigger the faulty behavior in a recent version of
heartbeat?  (Please tell me that it won't).
You may argue that a wrong configuration can cause all sorts
of error behavior but I don't think that heartbeat should have
ignored this error and continue to start the resource.

Thanks for reading this far,

Peter


syslog attached.

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] BadThingsHappen with v2.0.5.

2007-04-19 Thread Peter Kruse

Hello,

thanks for reading this, as it's with ancient v2.0.5., please tell me
that this problem can not happen with recent version of heartbeat.

Problem description:
yesterday in one of our 2node HA-Clusters a successful takeover
happened, where the failed node was  resetted, so far so good.
After I started heartbeat again on the failed node, it tried
to takeover the resources, although they were running
on the other node (BAD!).
Ok, I detected an error in the setup, /var/lib/heartbeat/pengine
was not writable by hacluster, causing this error message:

pengine: [5580]: ERROR: Cannot write to 
/var/lib/heartbeat/pengine/pe-input-0.bz2: Permission denied


Now my question:

Is this error responsible for the faulty behavior of heartbeat?
Will this error also trigger the faulty behavior in a recent version of
heartbeat?  (Please tell me that it won't).
You may argue that a wrong configuration can cause all sorts
of error behavior but I don't think that heartbeat should have
ignored this error and continue to start the resource.

Thanks for reading this far,

Peter


syslog attached.


heartbeatlog.gz
Description: GNU Zip compressed data
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems