Re: [Linux-HA] BadThingsHappen with v2.0.5.
Andrew Beekhof wrote: > On 4/19/07, Peter Kruse <[EMAIL PROTECTED]> wrote: >> Andrew Beekhof wrote: >> > then i'm afraid your use of the "dont fence nodes on startup" option >> > has come back to haunt you >> > >> > beosrv-c-1 came up but was not able to find beosrv-c-2 (even though it >> > _was_ running) and because of that option beosrv-c-1 just pretended >> > beosrv-c-2 wasn't running and happily started activating resources. >> > >> > remember how we said that option wasn't a good idea :-) >> >> Hm, I don't understand, beosrv-c-2 fenced beosrv-c-1 in order >> to take over. Now you say, that as soon as beosrv-c-1 came back >> up again, it should fence beosrv-c-2, because it "thought" it >> was not there, but it was there? How can this happen? > > usually an enduring communications failure (be it physical or in our > software) but i'm no expert regarding the membership and > communications layers > > But I see a lot of messages like: > Apr 19 09:49:47 beosrv-c-1 heartbeat: [4495]: WARN: Rexmit of seq > 3553687 requested. 141 is max. > > so _something_ isn't right. > > probably worthy of a bug report. There have been some bugs in this code in the last year or so. I've forgotten what they were, unfortunately. A hint is the string "ERROR:". We don't use that word lightly. If you get an ERROR: from one of our pieces of code, the chances are 99% that it shouldn't _ever_ happen. Getting it hundreds of times like you did is a really bad sign. Apr 19 09:48:27 beosrv-c-2 heartbeat: [10763]: ERROR: Message hist queue is filling up (200 messages in queue) Apr 19 09:48:27 beosrv-c-2 heartbeat: [10763]: ERROR: Message hist queue is filling up (200 messages in queue) What this message normally means is that you have a half-duplex communication failure. That is, one node can transmit but not receive, or vice versa... Are both systems version 2.0.5? [I'm guessing not] Is there a chance that you installed a 2.0.5 pre-release? Because there was a bug fix which went in just as 2.0.5 was coming out. And this fix: http://hg.linux-ha.org/dev/rev/6b8bdf5332c3 which could have affected you. How long was this node down? It looks to me like either it had been down a very long time, or a very short time. Which is it? If it was a very short time, then we have fixed the problem I believe... This particular sequence of messages is interesting... Apr 19 09:48:31 beosrv-c-2 heartbeat: [10763]: WARN: 1 lost packet(s) for [beosrv-c-1] [17:19] Apr 19 09:48:31 beosrv-c-2 cib: [10790]: info: mask(callbacks.c:cib_client_status_callback): Status update: Client beosrv-c-1/cib now has status [join] Apr 19 09:48:32 beosrv-c-2 heartbeat: [10763]: info: No pkts missing from beosrv-c-1! Apr 19 09:48:32 beosrv-c-2 heartbeat: [10763]: ERROR: Message hist queue is filling up (200 messages in queue) Here is what these messages mean: We received message 17 and 19 from beosrv-c-1. We didn't receive message 18 from beosrv-c-1. The code would then ask for packet to be retransmitted from beosrv-c-1. The CIB received a message from the CIB on beosrv-c-1, indicating that the CIB process on beosrv-c-1 is now running. Beosrv-c-1 retransmitted packet 18. We received packet 18, and now no packets are missing. The "Message hist queue is filling up" message means we have sent 200 packets without receiving an flow-control ack from someone. If there are only two nodes, that would mean beosrv-c-2. HOWEVER, we can definitely send and receive packets to and from both machines as witnessed by the "lost packet" followed by the "No pkts missing" sequence. This cannot have happened if we had a half-duplex comm failure. I know we fixed a couple of bugs in this area, but I'm not sure when the last one was fixed. I looked at bugzilla, and if a bugzilla had been made for every fix, then I don't see an obvious fix which was made after 2.0.5. -- Alan Robertson <[EMAIL PROTECTED]> "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] BadThingsHappen with v2.0.5.
Peter Kruse wrote: > Andrew Beekhof wrote: >> then i'm afraid your use of the "dont fence nodes on startup" option >> has come back to haunt you >> >> beosrv-c-1 came up but was not able to find beosrv-c-2 (even though it >> _was_ running) and because of that option beosrv-c-1 just pretended >> beosrv-c-2 wasn't running and happily started activating resources. >> >> remember how we said that option wasn't a good idea :-) > > Hm, I don't understand, beosrv-c-2 fenced beosrv-c-1 in order > to take over. Now you say, that as soon as beosrv-c-1 came back > up again, it should fence beosrv-c-2, because it "thought" it > was not there, but it was there? How can this happen? It's a bug :-D. -- Alan Robertson <[EMAIL PROTECTED]> "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] BadThingsHappen with v2.0.5.
On 4/19/07, Peter Kruse <[EMAIL PROTECTED]> wrote: Andrew Beekhof wrote: > then i'm afraid your use of the "dont fence nodes on startup" option > has come back to haunt you > > beosrv-c-1 came up but was not able to find beosrv-c-2 (even though it > _was_ running) and because of that option beosrv-c-1 just pretended > beosrv-c-2 wasn't running and happily started activating resources. > > remember how we said that option wasn't a good idea :-) Hm, I don't understand, beosrv-c-2 fenced beosrv-c-1 in order to take over. Now you say, that as soon as beosrv-c-1 came back up again, it should fence beosrv-c-2, because it "thought" it was not there, but it was there? How can this happen? usually an enduring communications failure (be it physical or in our software) but i'm no expert regarding the membership and communications layers But I see a lot of messages like: Apr 19 09:49:47 beosrv-c-1 heartbeat: [4495]: WARN: Rexmit of seq 3553687 requested. 141 is max. so _something_ isn't right. probably worthy of a bug report. Peter ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] BadThingsHappen with v2.0.5.
Andrew Beekhof wrote: then i'm afraid your use of the "dont fence nodes on startup" option has come back to haunt you beosrv-c-1 came up but was not able to find beosrv-c-2 (even though it _was_ running) and because of that option beosrv-c-1 just pretended beosrv-c-2 wasn't running and happily started activating resources. remember how we said that option wasn't a good idea :-) Hm, I don't understand, beosrv-c-2 fenced beosrv-c-1 in order to take over. Now you say, that as soon as beosrv-c-1 came back up again, it should fence beosrv-c-2, because it "thought" it was not there, but it was there? How can this happen? Peter ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] BadThingsHappen with v2.0.5.
On 4/19/07, Peter Kruse <[EMAIL PROTECTED]> wrote: Hi Andrew! Andrew Beekhof wrote: > beosrv-c-2 is the failed node right? it was beosrv-c-1 that failed, beosrv-c-2 took over. then i'm afraid your use of the "dont fence nodes on startup" option has come back to haunt you beosrv-c-1 came up but was not able to find beosrv-c-2 (even though it _was_ running) and because of that option beosrv-c-1 just pretended beosrv-c-2 wasn't running and happily started activating resources. remember how we said that option wasn't a good idea :-) > > do you have logs from there too? attached (messages about Gmain_timeout removed, there were too many of them) The problem now is that cibadmin -m reports: CIB on localhost _is_ the master instance on both nodes. Thanks for your time, Peter ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] BadThingsHappen with v2.0.5.
Hi Andrew! Andrew Beekhof wrote: beosrv-c-2 is the failed node right? it was beosrv-c-1 that failed, beosrv-c-2 took over. do you have logs from there too? attached (messages about Gmain_timeout removed, there were too many of them) The problem now is that cibadmin -m reports: CIB on localhost _is_ the master instance on both nodes. Thanks for your time, Peter heartbeatlog2.gz Description: GNU Zip compressed data ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] BadThingsHappen with v2.0.5.
On 4/19/07, Peter Kruse <[EMAIL PROTECTED]> wrote: Hello, thanks for reading this, as it's with ancient v2.0.5., please tell me that this problem can not happen with recent version of heartbeat. Problem description: yesterday in one of our 2node HA-Clusters a successful takeover happened, where the failed node was resetted, so far so good. After I started heartbeat again on the failed node, it tried to takeover the resources, although they were running on the other node (BAD!). beosrv-c-2 is the failed node right? do you have logs from there too? Ok, I detected an error in the setup, /var/lib/heartbeat/pengine was not writable by hacluster, causing this error message: pengine: [5580]: ERROR: Cannot write to /var/lib/heartbeat/pengine/pe-input-0.bz2: Permission denied Now my question: Is this error responsible for the faulty behavior of heartbeat? no. those files are purely for debugging problems such as the one you're reporting Will this error also trigger the faulty behavior in a recent version of heartbeat? (Please tell me that it won't). You may argue that a wrong configuration can cause all sorts of error behavior but I don't think that heartbeat should have ignored this error and continue to start the resource. Thanks for reading this far, Peter syslog attached. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] BadThingsHappen with v2.0.5.
Hello, thanks for reading this, as it's with ancient v2.0.5., please tell me that this problem can not happen with recent version of heartbeat. Problem description: yesterday in one of our 2node HA-Clusters a successful takeover happened, where the failed node was resetted, so far so good. After I started heartbeat again on the failed node, it tried to takeover the resources, although they were running on the other node (BAD!). Ok, I detected an error in the setup, /var/lib/heartbeat/pengine was not writable by hacluster, causing this error message: pengine: [5580]: ERROR: Cannot write to /var/lib/heartbeat/pengine/pe-input-0.bz2: Permission denied Now my question: Is this error responsible for the faulty behavior of heartbeat? Will this error also trigger the faulty behavior in a recent version of heartbeat? (Please tell me that it won't). You may argue that a wrong configuration can cause all sorts of error behavior but I don't think that heartbeat should have ignored this error and continue to start the resource. Thanks for reading this far, Peter syslog attached. heartbeatlog.gz Description: GNU Zip compressed data ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems