Re: [Linux-ha-dev] ERROR: Message hist queue is filling up
Lars Marowsky-Bree wrote: On 2006-04-21T12:37:27, Andrew Beekhof <[EMAIL PROTECTED]> wrote: None of this was happening when I built heartbeat two days ago (2006-04-19 16:08). I recently got rid of a bunch of really horrible scripts that used to test basic CRM sanity... now we leverage CTS in local-only mode. Which is good because its picking up new problems. What the error is about I dont know. Alan mentioned it to me on the phone yesterday. MSGHIST is filling up because obviously it never ever hears from the other node (which is eternally dead). New bug :-( In particular, this only happens before we start declaring nodes dead. So, it probably doesn't have much effect in current verison real systems. -- Alan Robertson <[EMAIL PROTECTED]> "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] ERROR: Message hist queue is filling up
Andrew Beekhof wrote: On Apr 21, 2006, at 11:53 AM, Dejan Muhamedagic wrote: Hello, In the latest HEAD code now I get a bunch of messages: heartbeat[9464]: 2006/04/21_11:26:38 ERROR: Message hist queue is filling up (200 messages in queue) The test fails: Apr 21 11:28:57 sapcl03 CTS: No failure count but success != requested iterations CRM tests failed (rc=1). This is in BasicSanityCheck then? And there is an exception raised in CTStests.py. The exception is because there were too many of those "hist queue is filling up" errors None of this was happening when I built heartbeat two days ago (2006-04-19 16:08). I recently got rid of a bunch of really horrible scripts that used to test basic CRM sanity... now we leverage CTS in local-only mode. Which is good because its picking up new problems. What the error is about I dont know. This is what I've been working on yesterday. It was a little more stubborn than I wanted. Gshi reviewed my patch with me last night - but the problem that BSC is picking up is because of some decisions I made to make BSC as aggressive as I know how, and another decision to shrink the size of heartbeat's memory footprint a good bit. I'll commit the patch this morning (my time). -- Alan Robertson <[EMAIL PROTECTED]> "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Re: [Linux-ha-cvs] Linux-HA CVS: crm by andrew from
On Apr 21, 2006, at 12:51 PM, Lars Marowsky-Bree wrote: On 2006-04-21T12:32:26, Andrew Beekhof <[EMAIL PROTECTED]> wrote: Why is that a warning? That's perfectly normal behaviour if several nodes are equal - share some common attribute etc. the extreme case is if both nodes have a score of INFINITY when you're trying to migrate a resource Hm, I think the extreme case is the only one where there's a problem, as it can't run on both nodes, while infinity implies it MUST run there. Anything below infinity isn't a problem. So I'd invert the logic to not not complain for == 0, but only for == INFINITY... it only complains for non-zero at the moment, but i can make be even stricter Sincerely, Lars Marowsky-Brée -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ -- Andrew Beekhof "I'd find myself if I knew where myself left me" - MGF ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] ERROR: Message hist queue is filling up
On 2006-04-21T12:37:27, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > >None of this was happening when I built heartbeat two days ago > >(2006-04-19 16:08). > > I recently got rid of a bunch of really horrible scripts that used to > test basic CRM sanity... now we leverage CTS in local-only mode. > Which is good because its picking up new problems. > > What the error is about I dont know. Alan mentioned it to me on the phone yesterday. MSGHIST is filling up because obviously it never ever hears from the other node (which is eternally dead). New bug :-( Sincerely, Lars Marowsky-Brée -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Re: [Linux-ha-cvs] Linux-HA CVS: crm by andrew from
On 2006-04-21T12:32:26, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > >Why is that a warning? That's perfectly normal behaviour if several > >nodes are equal - share some common attribute etc. > the extreme case is if both nodes have a score of INFINITY when > you're trying to migrate a resource Hm, I think the extreme case is the only one where there's a problem, as it can't run on both nodes, while infinity implies it MUST run there. Anything below infinity isn't a problem. So I'd invert the logic to not not complain for == 0, but only for == INFINITY... Sincerely, Lars Marowsky-Brée -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] ERROR: Message hist queue is filling up
On Apr 21, 2006, at 11:53 AM, Dejan Muhamedagic wrote: Hello, In the latest HEAD code now I get a bunch of messages: heartbeat[9464]: 2006/04/21_11:26:38 ERROR: Message hist queue is filling up (200 messages in queue) The test fails: Apr 21 11:28:57 sapcl03 CTS: No failure count but success != requested iterations CRM tests failed (rc=1). This is in BasicSanityCheck then? And there is an exception raised in CTStests.py. The exception is because there were too many of those "hist queue is filling up" errors None of this was happening when I built heartbeat two days ago (2006-04-19 16:08). I recently got rid of a bunch of really horrible scripts that used to test basic CRM sanity... now we leverage CTS in local-only mode. Which is good because its picking up new problems. What the error is about I dont know. -- Andrew Beekhof "No means no, and no means yes, and everything in between and all the rest" - TISM ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Re: [Linux-ha-cvs] Linux-HA CVS: crm by andrew from
On Apr 21, 2006, at 12:15 PM, Lars Marowsky-Bree wrote: On 2006-04-21T03:18:27, linux-ha-cvs@lists.linux-ha.org wrote: linux-ha CVS committal Author : andrew Host: Project : linux-ha Module : crm Dir : linux-ha/crm/pengine Modified Files: stages.c utils.c pe_utils.h Log Message: Log a warning if more than one node has the highest score for running a given resource. Why is that a warning? That's perfectly normal behaviour if several nodes are equal - share some common attribute etc. the extreme case is if both nodes have a score of INFINITY when you're trying to migrate a resource but i'm happy to lower it to LOG_INFO And, isn't that the default for _all_ nodes in a symmetric cluster? it also checks for non-zero, so this case wont be complained about. Sincerely, Lars Marowsky-Brée -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ -- Andrew Beekhof "Ooo Ahhh, Glenn McRath" - TISM ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] Re: [Linux-ha-cvs] Linux-HA CVS: crm by andrew from
On 2006-04-21T03:18:27, linux-ha-cvs@lists.linux-ha.org wrote: > linux-ha CVS committal > > Author : andrew > Host: > Project : linux-ha > Module : crm > > Dir : linux-ha/crm/pengine > > > Modified Files: > stages.c utils.c pe_utils.h > > > Log Message: > > > Log a warning if more than one node has the highest score for running a > given resource. Why is that a warning? That's perfectly normal behaviour if several nodes are equal - share some common attribute etc. And, isn't that the default for _all_ nodes in a symmetric cluster? Sincerely, Lars Marowsky-Brée -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge" ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] core dump (abort) in crmd: "untracked process" (HEAD)
On Apr 20, 2006, at 3:58 PM, Dejan Muhamedagic wrote: Hi, #4 0x0805add0 in build_operation_update (xml_rsc=0x8109430, op=0x8282b98, src=0x80692d9 "do_update_resource", lpc=0) at lrm.c:347 (gdb) print *0x8282b98 $1 = 136448640 If you want I can send you the core off list. I keep all the cores :) thats ok - i think i understand the problem enough now. and since you're running against HEAD, if you update you'll get the fixes :) Cheers, Dejan On Thu, Apr 20, 2006 at 09:15:28AM +0200, Andrew Beekhof wrote: On 4/20/06, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: Hello, Running CTS with HEAD hanged the cluster after crmd dumped core (abort). It happened after 53 tests with this curious message: Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: mask(lrm.c:build_operation_update): Triggered non-fatal assert at lrm.c:349: fsa_our_dc_version != NULL We have two kinds of asserts... neither are supposed to happen and both create a core file so that we can diagnose how we got there. However non-fatal ones call fork first (so the main process doesn't die) and then take some recovery action. Sometimes the non-fatal varieties are used in new pieces of code to make sure they work as we expect and that is what has happened here. Do you still have the core file? I'd be interested to know the result of: print *op from frame #4 In the meantime, I'll look at the logs and see what I can figure out. Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]: ERROR: Exiting untracked process process 19654 dumped core Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]: ERROR: mask(utils.c:crm_timer_popped): Finalization Timer (I_ELECTION) just popped! The cluster looks like this, unchanged for several hours: Last updated: Thu Apr 20 04:43:47 2006 Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d) 3 Nodes configured. 3 Resources configured. Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online Resource Group: group_1 IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03 LVM_2 (heartbeat::ocf:LVM): Stopped Filesystem_3(heartbeat::ocf:Filesystem):Stopped Resource Group: group_2 IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02 LVM_3 (heartbeat::ocf:LVM): Started sapcl02 Filesystem_4(heartbeat::ocf:Filesystem):Started sapcl02 Resource Group: group_3 IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03 LVM_4 (heartbeat::ocf:LVM): Started sapcl03 Filesystem_5(heartbeat::ocf:Filesystem):Started sapcl03 And: sapcl01# crmadmin -S sapcl01 Status of [EMAIL PROTECTED]: S_TERMINATE (ok) All processes are still running on this node, but heartbeat seems to be in some kind of limbo. Cheers, Dejan ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ -- Andrew Beekhof "Ooo Ahhh, Glenn McRath" - TISM ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/