Re: [Pacemaker] Release model

2013-06-28 Thread Lars Marowsky-Bree
On 2013-06-28T18:41:35, Andrew Beekhof and...@beekhof.net wrote:

  There's an exception: dropping commonly used external interfaces (say,
  ptest) needs to be announced a few releases in advance before enacted
  upstream. (And if Enterprise distributions want to keep something, they
  have time to prepare for that.) And of course, if major components get
  rewritten, they either need more testing or should be in place in
  parallel for 1 or 2 releases.
 Now we start to diverge...
 
 Keeping two lrmd's around? Two stonithd's?

Well, I can dream, can't I. ;-) But perhaps you're right. The LRM
rewrite taught us something about the perils of rewriting components
that are badly documented and don't have good regression tests and where
not all options they supported were written down somewhere.

But as an isolated component, would it have been so difficult to ship a
separate implementation of the LRM first, perhaps as a compile time
switch? (Assuming the interface to the component doesn't change so much.
It could hardly have been worse than supporting all those different
messaging APIs and their versions.)

The latter is, perhaps, not a bad example.

 Or two copies of the PE after I rewrite ordering constraints? Urgh :-(

The PE is different; almost all of its features are documented and
protected by strong regressions tests. That support for an option would
be dropped by accident is almost unthinkable. Hence, the implementation
can be considered almost entirely internal.

But people were using options that the new LRM no longer supported,
called lrmadmin in some of their scripts, etc. So I think the
differentiation between the PE and the LRM does exist.

Perhaps the lesson is Write regression tests before a rewrite. (And
I'm not saying it's a lesson that depended entirely on you or David. If
cluster-glue's LRM had had such a suite, it'd certainly have helped
tons.)

The Linux kernel 3.x series seems to be coping quite nicely, too. They
do have stable series to which they backport, though. That's always an
option: if $someone feels the need to do longer support for, say,
1.1.10, they can always can help start 1.1.10.x.

 If that sort of thing wasn't such a PITA you'd have done it with 1.1.8.

Yeah, and there were some here who advocated this. Given the scope of
the other changes at the time, I thought it better to integrate it via a
different path into SLE HA.

 Which is the problem with the Firefox model - either there is no good time 
 to make them, or users hate us because we can make them at any time.

For Firefox, though, I've never noticed a problem (and I'm an ardent
follower of the updates). The exceptions are, of course, add-ons: so I
don't update until the add-ons I depend on are also updated.

 Even broadcasting changes can have limited value.
 To use a recent example, crmsh was left in place for well over a year (iirc) 
 before it was dropped.
 That didn't seem to help anything...

Probably a communication problem.

And the way how we fixed this on SLE HA was to pull in the new package
via a dependency, so that users never noticed that we split the
projects. Clearly, that's impossible to do when one chooses to drop a
major component for good.

  (Perpetuated by customers willing to pay for it, and because admittedly
  not all components have good test suites.)
 Me too, but how do we do this where all the downside doesn't fall on me?

I'm not sure there's a huge downside in it for you? You'd get to develop
and bring forward pacemaker 2.x all you want - and if RHEL7 wanted to
freeze a specific version, they'd support 2.x.y for that. (OK, so that
would probably be you too, though.)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Release model

2013-06-28 Thread Andrew Beekhof

On 28/06/2013, at 8:59 PM, Lars Marowsky-Bree l...@suse.com wrote:

 On 2013-06-28T18:41:35, Andrew Beekhof and...@beekhof.net wrote:
 
 There's an exception: dropping commonly used external interfaces (say,
 ptest) needs to be announced a few releases in advance before enacted
 upstream. (And if Enterprise distributions want to keep something, they
 have time to prepare for that.) And of course, if major components get
 rewritten, they either need more testing or should be in place in
 parallel for 1 or 2 releases.
 Now we start to diverge...
 
 Keeping two lrmd's around? Two stonithd's?
 
 Well, I can dream, can't I. ;-)

Of course.

 But perhaps you're right. The LRM
 rewrite taught us something about the perils of rewriting components
 that are badly documented and don't have good regression tests and where
 not all options they supported were written down somewhere.
 
 But as an isolated component, would it have been so difficult to ship a
 separate implementation of the LRM first, perhaps as a compile time
 switch?

Honestly - more than likely it would have been.
Looking at the commit for just the glue between crmd and lrmd:

# git show --stat 6f8f559 crmd/lrm.c
commit 6f8f5594a940b015cf7fadb26c1da7e110d73103
Author: David Vossel dvos...@redhat.com
Date:   Wed May 30 17:42:37 2012 -0500

High: crmd: Enable use of new lrmd daemon and client library in crmd.

 crmd/lrm.c | 687 
+++---
 1 file changed, 274 insertions(+), 413 deletions(-)


there is certainly some easy stuff to mask:

* HA_OK - lrmd_ok
* lrm_free_rsc() - lrmd_free_rsc_info()

But there's also some fundamental changes to the crmd/lrmd interaction.

 (Assuming the interface to the component doesn't change so much.
 It could hardly have been worse than supporting all those different
 messaging APIs and their versions.)
 
 The latter is, perhaps, not a bad example.

Not a bad example, but the different messaging APIs have/had an expected future 
lifespan of more than a release or two.

 
 Or two copies of the PE after I rewrite ordering constraints? Urgh :-(
 
 The PE is different; almost all of its features are documented and
 protected by strong regressions tests. That support for an option would
 be dropped by accident is almost unthinkable. Hence, the implementation
 can be considered almost entirely internal.
 
 But people were using options that the new LRM no longer supported,
 called lrmadmin in some of their scripts, etc. So I think the
 differentiation between the PE and the LRM does exist.

Agreed.

 
 Perhaps the lesson is Write regression tests before a rewrite. (And
 I'm not saying it's a lesson that depended entirely on you or David. If
 cluster-glue's LRM had had such a suite, it'd certainly have helped
 tons.)

I think he did actually.
But given that some features were an undocumented distant memory, it was hard 
for him to know to write a test for it.

 The Linux kernel 3.x series seems to be coping quite nicely, too. They
 do have stable series to which they backport, though. That's always an
 option: if $someone feels the need to do longer support for, say,
 1.1.10, they can always can help start 1.1.10.x.
 
 If that sort of thing wasn't such a PITA you'd have done it with 1.1.8.
 
 Yeah, and there were some here who advocated this. Given the scope of
 the other changes at the time, I thought it better to integrate it via a
 different path into SLE HA.
 
 Which is the problem with the Firefox model - either there is no good time 
 to make them, or users hate us because we can make them at any time.
 
 For Firefox, though, I've never noticed a problem (and I'm an ardent
 follower of the updates). The exceptions are, of course, add-ons: so I
 don't update until the add-ons I depend on are also updated.
 
 Even broadcasting changes can have limited value.
 To use a recent example, crmsh was left in place for well over a year (iirc) 
 before it was dropped.
 That didn't seem to help anything...
 
 Probably a communication problem.
 
 And the way how we fixed this on SLE HA was to pull in the new package
 via a dependency, so that users never noticed that we split the
 projects. Clearly, that's impossible to do when one chooses to drop a
 major component for good.
 
 (Perpetuated by customers willing to pay for it, and because admittedly
 not all components have good test suites.)
 Me too, but how do we do this where all the downside doesn't fall on me?
 
 I'm not sure there's a huge downside in it for you?

Ok, lets take attrd for example - which I've been wanted to rewrite to be truly 
atomic for half a decade or more.

Under this model, not only do I have to find the time to write and test the new 
addition, but I also have to:
* keep maintaining the old code until... when?
* probably write 

Re: [Pacemaker] Release model

2013-06-28 Thread Dejan Muhamedagic
Hi Lars,

On Fri, Jun 28, 2013 at 12:59:22PM +0200, Lars Marowsky-Bree wrote:
[...]
 If
 cluster-glue's LRM had had such a suite, it'd certainly have helped
 tons.)

It did have a regression suite.

Thanks,

Dejan

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Release model

2013-06-28 Thread Lars Marowsky-Bree
On 2013-06-28T14:49:06, Dejan Muhamedagic deja...@fastmail.fm wrote:

  If cluster-glue's LRM had had such a suite, it'd certainly have
  helped tons.)
 It did have a regression suite.

Yes, well, but it didn't test for LRM_MAX_CHILDREN or the secret
support, for example. So it didn't really document the interface
completely either.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Release model

2013-06-28 Thread Lars Marowsky-Bree
On 2013-06-28T22:04:48, Andrew Beekhof and...@beekhof.net wrote:

 I think he did actually.

Well, yes, but the hg history or reading the existing code would
probably have been quite helpful. I'll take not well documented, but
it's hard to say the rewrite was handled very well. But I don't want to
get drawn into this too much though, it's side tracking this
discussion. And done by now. And hopefully we'll not do something like
that again ;-)

  I'm not sure there's a huge downside in it for you?
 Ok, lets take attrd for example - which I've been wanted to rewrite to be 
 truly atomic for half a decade or more.

If it's rewritten in a way that doesn't affect external users but that
can be covered well by tests, I'd not think that having two versions of
the code in parallel would make sense, yes.

 In my perfect world, under this model, RH would dip into the releases and 
 take every 2nd or 3rd, whatever was ready at the time.

Yes, so effectively what SLE HA is already doing. Why not live in that
perfect world now! ;-)

 Btw. _IF_ we do this, I'd be wanting to go with Pacemaker-$x (no .y or .z).
 We shouldn't create the impression of doing release series when we're not.

I was mostly stealing the numbering scheme from the Linux kernel. But if
you're mostly thinking about this in terms of Firefox, sure. I don't
really mind.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Release model

2013-06-28 Thread Digimer
On 06/28/2013 08:04 AM, Andrew Beekhof wrote:
 Under this model, not only do I have to find the time to write and test the 
 new addition, but I also have to:
 * keep maintaining the old code until... when?
 * probably write and maintain a compatibility layer
 * make it possible to choose which gets used (a small but annoying task)
 * make it possible to figure out which was in use for support
 * educate people that there are two, and when to chose one over the other
 * answer copious emails from confused users

Won't all this need to be settled before RHEL 7 ships anyway? There will
be a 10 year support life of the version shipped on day 1, and I quite
strongly expect that pcmk will have long-ago moved on while this version
still needs supporting.

Or will that part of things be handled but a team in RHEL, as you
mentioned earlier about it being up to the EL people to sort out?

I've got no opinion either way, I am just curious as a user how the
mechanics of this work/will work.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Release model

2013-06-28 Thread Andrew Beekhof

On 28/06/2013, at 11:37 PM, Lars Marowsky-Bree l...@suse.com wrote:

 
 I'm not sure there's a huge downside in it for you?
 Ok, lets take attrd for example - which I've been wanted to rewrite to be 
 truly atomic for half a decade or more.
 
 If it's rewritten in a way that doesn't affect external users but that
 can be covered well by tests, I'd not think that having two versions of
 the code in parallel would make sense, yes.

attrd is quite tough to write unit tests for - almost all of its functionality 
requires multiple nodes.
hence why i picked it for illustration
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org