Hello,

Andrew Beekhof wrote:

>
> On Jan 19, 2006, at 12:54 PM, Lars Marowsky-Bree wrote:
>
>> - #1037: lrmd reports TIMEOUT althogh RA was never called
>>
>> This looks fairly obscure. I've asked for a clarification with CVS 
>> HEAD,
>> because we've seen so many changes it's hard to say whether it's still
>> an issue.
>>
>>
>> I'm ignoring everything below critical for now; those are, by
>> definition, not release critical, even though they may be major
>> annoyances. But I think we need to roll this out _now_. I think we
>> should, if we decide to give it a thumbs up, be able to roll this out
>> after a weekend of test cycles.
>
>
> I'd second that.  I've been running a few thousand tests in the last 
> week and its been pretty damn stable.


The problem I observe only manifested after the resources have been
online for about 24h with
one or two resource groups with some resources defined in it.  So I'm
not sure that the
tests you run really are "real live" so to say.  The resource agents
really put some
stress on the cib as they run crm_attribute on every monitor action,
that's about
10 RAs calling crm_attribute every 30 seconds. This results in the message
"Processing cib_query operation from ..." occuring in syslog about every
second.
I have two installations running CVS revision from 18.1.2005 running
until now without
problems - knock on wood.
Please let me stress this further:  Your tests are important to see if
your code
is reliable.  Unfortunately they don't seem to be enough.  I don't want
to get
into the discussion of "you cannot test everything".  That is granted.
But it seems it would be good to run tests with more "real-live-examples" -
and those for a longer period of time.  If you have the resources to set
up a cluster with two physical machines and define resource groups
with - well why not - all possible resources (nfs, samba, drbd, ...)
please do so.

>
> The only thing is possible reason (from the CRM-side) to delay a 
> release is if we can find a root to Peter's CIB problems.

yes, please.  From my own experience I rather prefer not to consider
problems
I cannot reproduce, sure.  And I don't expect you to take responsibility for
Resource Agents not written by you.  But believe me, on _every_
installation we made
so far we had the same problem - that is, lrmd reporting a timeout
on one resource agent and heartbeat was not able to recover which is ...
well - bad.
So far I have tracked down the problem to one of the crm_attribute calls
taking too much time at one point.

As I'm not a coder, it's not easy for me to understand the details
of heartbeat, but I'm willing to, and going to help make heartbeat
_the_ opensource HA software available.

Thanks,

    Peter
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to