Re: [Linux-ha-dev] ERROR: Message hist queue is filling up

2006-04-21 Thread Alan Robertson

Lars Marowsky-Bree wrote:

On 2006-04-21T12:37:27, Andrew Beekhof <[EMAIL PROTECTED]> wrote:


None of this was happening when I built heartbeat two days ago
(2006-04-19 16:08).
I recently got rid of a bunch of really horrible scripts that used to  
test basic CRM sanity... now we leverage CTS in local-only mode.

Which is good because its picking up new problems.

What the error is about I dont know.


Alan mentioned it to me on the phone yesterday.

MSGHIST is filling up because obviously it never ever hears from the
other node (which is eternally dead). New bug :-(


In particular, this only happens before we start declaring nodes dead. 
So, it probably doesn't have much effect in current verison real systems.



--
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] ERROR: Message hist queue is filling up

2006-04-21 Thread Alan Robertson

Andrew Beekhof wrote:


On Apr 21, 2006, at 11:53 AM, Dejan Muhamedagic wrote:


Hello,

In the latest HEAD code now I get a bunch of messages:

heartbeat[9464]: 2006/04/21_11:26:38 ERROR: Message hist queue is  
filling up (200 messages in queue)


The test fails:

Apr 21 11:28:57 sapcl03 CTS: No failure count but success !=  
requested iterations

CRM tests failed (rc=1).


This is in BasicSanityCheck then?



And there is an exception raised in CTStests.py.



The exception is because there were too many of those "hist queue is  
filling up" errors



None of this was happening when I built heartbeat two days ago
(2006-04-19 16:08).


I recently got rid of a bunch of really horrible scripts that used to  
test basic CRM sanity... now we leverage CTS in local-only mode.

Which is good because its picking up new problems.

What the error is about I dont know.


This is what I've been working on yesterday.  It was a little more 
stubborn than I wanted.



Gshi reviewed my patch with me last night - but the problem that BSC is 
picking up is because of some decisions I made to make BSC as aggressive 
as I know how, and another decision to shrink the size of heartbeat's 
memory footprint a good bit.


I'll commit the patch this morning (my time).


--
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Re: [Linux-ha-cvs] Linux-HA CVS: crm by andrew from

2006-04-21 Thread Andrew Beekhof


On Apr 21, 2006, at 12:51 PM, Lars Marowsky-Bree wrote:


On 2006-04-21T12:32:26, Andrew Beekhof <[EMAIL PROTECTED]> wrote:


Why is that a warning? That's perfectly normal behaviour if several
nodes are equal - share some common attribute etc.

the extreme case is if both nodes have a score of INFINITY when
you're trying to migrate a resource


Hm, I think the extreme case is the only one where there's a  
problem, as

it can't run on both nodes, while infinity implies it MUST run there.
Anything below infinity isn't a problem. So I'd invert the logic to  
not

not complain for == 0, but only for == INFINITY...


it only complains for non-zero at the moment, but i can make be even  
stricter





Sincerely,
Lars Marowsky-Brée

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


--
Andrew Beekhof

"I'd find myself if I knew where myself left me" - MGF


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] ERROR: Message hist queue is filling up

2006-04-21 Thread Lars Marowsky-Bree
On 2006-04-21T12:37:27, Andrew Beekhof <[EMAIL PROTECTED]> wrote:

> >None of this was happening when I built heartbeat two days ago
> >(2006-04-19 16:08).
> 
> I recently got rid of a bunch of really horrible scripts that used to  
> test basic CRM sanity... now we leverage CTS in local-only mode.
> Which is good because its picking up new problems.
> 
> What the error is about I dont know.

Alan mentioned it to me on the phone yesterday.

MSGHIST is filling up because obviously it never ever hears from the
other node (which is eternally dead). New bug :-(


Sincerely,
Lars Marowsky-Brée

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Re: [Linux-ha-cvs] Linux-HA CVS: crm by andrew from

2006-04-21 Thread Lars Marowsky-Bree
On 2006-04-21T12:32:26, Andrew Beekhof <[EMAIL PROTECTED]> wrote:

> >Why is that a warning? That's perfectly normal behaviour if several
> >nodes are equal - share some common attribute etc.
> the extreme case is if both nodes have a score of INFINITY when  
> you're trying to migrate a resource

Hm, I think the extreme case is the only one where there's a problem, as
it can't run on both nodes, while infinity implies it MUST run there.
Anything below infinity isn't a problem. So I'd invert the logic to not
not complain for == 0, but only for == INFINITY...


Sincerely,
Lars Marowsky-Brée

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] ERROR: Message hist queue is filling up

2006-04-21 Thread Andrew Beekhof


On Apr 21, 2006, at 11:53 AM, Dejan Muhamedagic wrote:


Hello,

In the latest HEAD code now I get a bunch of messages:

heartbeat[9464]: 2006/04/21_11:26:38 ERROR: Message hist queue is  
filling up (200 messages in queue)


The test fails:

Apr 21 11:28:57 sapcl03 CTS: No failure count but success !=  
requested iterations

CRM tests failed (rc=1).


This is in BasicSanityCheck then?



And there is an exception raised in CTStests.py.



The exception is because there were too many of those "hist queue is  
filling up" errors



None of this was happening when I built heartbeat two days ago
(2006-04-19 16:08).


I recently got rid of a bunch of really horrible scripts that used to  
test basic CRM sanity... now we leverage CTS in local-only mode.

Which is good because its picking up new problems.

What the error is about I dont know.

--
Andrew Beekhof

"No means no, and no means yes, and everything in between and all the  
rest" - TISM


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Re: [Linux-ha-cvs] Linux-HA CVS: crm by andrew from

2006-04-21 Thread Andrew Beekhof


On Apr 21, 2006, at 12:15 PM, Lars Marowsky-Bree wrote:


On 2006-04-21T03:18:27, linux-ha-cvs@lists.linux-ha.org wrote:


linux-ha CVS committal

Author  : andrew
Host:
Project : linux-ha
Module  : crm

Dir : linux-ha/crm/pengine


Modified Files:
stages.c utils.c pe_utils.h


Log Message:


Log a warning if more than one node has the highest score for  
running a

  given resource.


Why is that a warning? That's perfectly normal behaviour if several
nodes are equal - share some common attribute etc.


the extreme case is if both nodes have a score of INFINITY when  
you're trying to migrate a resource


but i'm happy to lower it to LOG_INFO



And, isn't that the default for _all_ nodes in a symmetric cluster?


it also checks for non-zero, so this case wont be complained about.



Sincerely,
Lars Marowsky-Brée

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


--
Andrew Beekhof

"Ooo Ahhh, Glenn McRath" - TISM


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] Re: [Linux-ha-cvs] Linux-HA CVS: crm by andrew from

2006-04-21 Thread Lars Marowsky-Bree
On 2006-04-21T03:18:27, linux-ha-cvs@lists.linux-ha.org wrote:

> linux-ha CVS committal
> 
> Author  : andrew
> Host: 
> Project : linux-ha
> Module  : crm
> 
> Dir : linux-ha/crm/pengine
> 
> 
> Modified Files:
>   stages.c utils.c pe_utils.h 
> 
> 
> Log Message:
> 
> 
> Log a warning if more than one node has the highest score for running a
>   given resource.

Why is that a warning? That's perfectly normal behaviour if several
nodes are equal - share some common attribute etc.

And, isn't that the default for _all_ nodes in a symmetric cluster?

Sincerely,
Lars Marowsky-Brée

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] core dump (abort) in crmd: "untracked process" (HEAD)

2006-04-21 Thread Andrew Beekhof


On Apr 20, 2006, at 3:58 PM, Dejan Muhamedagic wrote:


Hi,

#4  0x0805add0 in build_operation_update (xml_rsc=0x8109430,  
op=0x8282b98,

src=0x80692d9 "do_update_resource", lpc=0) at lrm.c:347

(gdb) print *0x8282b98
$1 = 136448640

If you want I can send you the core off list. I keep all the cores :)


thats ok - i think i understand the problem enough now.
and since you're running against HEAD, if you update you'll get the  
fixes :)




Cheers,

Dejan

On Thu, Apr 20, 2006 at 09:15:28AM +0200, Andrew Beekhof wrote:

On 4/20/06, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:

Hello,

Running CTS with HEAD hanged the cluster after crmd dumped core
(abort).  It happened after 53 tests with this curious message:

Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]:  
ERROR: mask(lrm.c:build_operation_update): Triggered non-fatal  
assert at lrm.c:349: fsa_our_dc_version != NULL


We have two kinds of asserts... neither are supposed to happen and
both create a core file so that we can diagnose how we got there.
However non-fatal ones call fork first (so the main process doesn't
die) and then take some recovery action.

Sometimes the non-fatal varieties are used in new pieces of code to
make sure they work as we expect and that is what has happened here.

Do you still have the core file?
I'd be interested to know the result of:
   print *op
from frame #4

In the meantime, I'll look at the logs and see what I can figure out.

Apr 19 17:48:01 BadNews: Apr 19 17:42:48 sapcl01 crmd: [17937]:  
ERROR: Exiting untracked process process 19654 dumped core
Apr 19 17:48:01 BadNews: Apr 19 17:45:49 sapcl01 crmd: [17937]:  
ERROR: mask(utils.c:crm_timer_popped): Finalization Timer  
(I_ELECTION) just popped!


The cluster looks like this, unchanged for several hours:


Last updated: Thu Apr 20 04:43:47 2006
Current DC: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d)
3 Nodes configured.
3 Resources configured.


Node: sapcl03 (0bfb78a2-fcd2-4f52-8a06-2d17437a6750): online
Node: sapcl02 (09fa194c-d7e1-41fa-a0d0-afd79a139181): online
Node: sapcl01 (85180fd0-70c9-4136-a5e0-90d89ea6079d): online

Resource Group: group_1
IPaddr_1(heartbeat::ocf:IPaddr):Started sapcl03
LVM_2   (heartbeat::ocf:LVM):   Stopped
Filesystem_3(heartbeat::ocf:Filesystem):Stopped
Resource Group: group_2
IPaddr_2(heartbeat::ocf:IPaddr):Started sapcl02
LVM_3   (heartbeat::ocf:LVM):   Started sapcl02
Filesystem_4(heartbeat::ocf:Filesystem):Started  
sapcl02

Resource Group: group_3
IPaddr_3(heartbeat::ocf:IPaddr):Started sapcl03
LVM_4   (heartbeat::ocf:LVM):   Started sapcl03
Filesystem_5(heartbeat::ocf:Filesystem):Started  
sapcl03


And:

sapcl01# crmadmin -S sapcl01
Status of [EMAIL PROTECTED]: S_TERMINATE (ok)

All processes are still running on this node, but heartbeat seems
to be in some kind of limbo.

Cheers,

Dejan


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/





___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


--
Andrew Beekhof

"Ooo Ahhh, Glenn McRath" - TISM


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/