On 2006-01-19T20:17:58, Peter Kruse <[EMAIL PROTECTED]> wrote:

> The problem I observe only manifested after the resources have been
> online for about 24h with one or two resource groups with some
> resources defined in it.  So I'm not sure that the tests you run
> really are "real live" so to say.  The resource agents really put some
> stress on the cib as they run crm_attribute on every monitor action,
> that's about 10 RAs calling crm_attribute every 30 seconds.

Woah, what are you calling crm_attribute for all the time?

> This results in the message
> "Processing cib_query operation from ..." occuring in syslog about every
> second.

Yeah, we know, logging needs tuning. This one probbably needs to be
tuned down.

> I have two installations running CVS revision from 18.1.2005 running
> until now without problems - knock on wood.

2005-01-18? Woah. We should have just not done the last year then ;-)

> Please let me stress this further:  Your tests are important to see if
> your code is reliable.  Unfortunately they don't seem to be enough.  I
> don't want to get into the discussion of "you cannot test everything".
> That is granted.

No, you're of course right, and we need to improve our testing effort
all the time. We'll do a lot more testing as we gear up for shipping
SLES10, for example, that includes more reallife testing by humans.

We try to add testcases for bugs which were fixed, so that we can catch
regressions, or similar bugs in other pieces of code.

Automated tests can never be a replacement for pilot deployments though;
they feed on them, but can't replace them ;-)

> But it seems it would be good to run tests with more "real-live-examples" -
> and those for a longer period of time.  If you have the resources to set
> up a cluster with two physical machines and define resource groups
> with - well why not - all possible resources (nfs, samba, drbd, ...)
> please do so.

Yes, long-term tests (ie, how do we perform in the _stable_ case?) is
something annoyingly difficult to test with a test harness. Or rather,
it's perfectly possible, just very boring.

In one way I want to get on the "we can't test everything" box though:
That's why your testing is so valuable to us. We'll never be able to
test what _you_ want, but we will be happy to fix the bugs you find...

> > The only thing is possible reason (from the CRM-side) to delay a
> > release is if we can find a root to Peter's CIB problems.
> yes, please.  From my own experience I rather prefer not to consider
> problems I cannot reproduce, sure.  And I don't expect you to take
> responsibility for Resource Agents not written by you.  But believe
> me, on _every_ installation we made so far we had the same problem -
> that is, lrmd reporting a timeout on one resource agent and heartbeat
> was not able to recover which is ...  well - bad.
> So far I have tracked down the problem to one of the crm_attribute
> calls taking too much time at one point.

Hm. How annoying, but I'll leave that to Andrew to debug I'm afraid,
he's the guru of that code.

A regression test which just pounds the CIB with queries from several
clients in parallel however seems a good idea. Andrew, if you're bored,
how about such a testcase? (We could add it to BSC, or at least run it
on demand there.)

> As I'm not a coder, it's not easy for me to understand the details
> of heartbeat, but I'm willing to, and going to help make heartbeat
> _the_ opensource HA software available.

Great!

Well and if you're using this for commercial deployments with the
intention of making money, I of course have to pitch using SLES because
that keeps us paid, but if you invest a modest amount of money into some
student's diploma thesis or practica on using/developing Linux HA,
you'll make her/him, yourself and the Linux HA project very happy ;-)



Sincerely,
    Lars Marowsky-Brée

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business     -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to