Re: [Pacemaker] Release model
On 2013-06-28T18:41:35, Andrew Beekhof and...@beekhof.net wrote: There's an exception: dropping commonly used external interfaces (say, ptest) needs to be announced a few releases in advance before enacted upstream. (And if Enterprise distributions want to keep something, they have time to prepare for that.) And of course, if major components get rewritten, they either need more testing or should be in place in parallel for 1 or 2 releases. Now we start to diverge... Keeping two lrmd's around? Two stonithd's? Well, I can dream, can't I. ;-) But perhaps you're right. The LRM rewrite taught us something about the perils of rewriting components that are badly documented and don't have good regression tests and where not all options they supported were written down somewhere. But as an isolated component, would it have been so difficult to ship a separate implementation of the LRM first, perhaps as a compile time switch? (Assuming the interface to the component doesn't change so much. It could hardly have been worse than supporting all those different messaging APIs and their versions.) The latter is, perhaps, not a bad example. Or two copies of the PE after I rewrite ordering constraints? Urgh :-( The PE is different; almost all of its features are documented and protected by strong regressions tests. That support for an option would be dropped by accident is almost unthinkable. Hence, the implementation can be considered almost entirely internal. But people were using options that the new LRM no longer supported, called lrmadmin in some of their scripts, etc. So I think the differentiation between the PE and the LRM does exist. Perhaps the lesson is Write regression tests before a rewrite. (And I'm not saying it's a lesson that depended entirely on you or David. If cluster-glue's LRM had had such a suite, it'd certainly have helped tons.) The Linux kernel 3.x series seems to be coping quite nicely, too. They do have stable series to which they backport, though. That's always an option: if $someone feels the need to do longer support for, say, 1.1.10, they can always can help start 1.1.10.x. If that sort of thing wasn't such a PITA you'd have done it with 1.1.8. Yeah, and there were some here who advocated this. Given the scope of the other changes at the time, I thought it better to integrate it via a different path into SLE HA. Which is the problem with the Firefox model - either there is no good time to make them, or users hate us because we can make them at any time. For Firefox, though, I've never noticed a problem (and I'm an ardent follower of the updates). The exceptions are, of course, add-ons: so I don't update until the add-ons I depend on are also updated. Even broadcasting changes can have limited value. To use a recent example, crmsh was left in place for well over a year (iirc) before it was dropped. That didn't seem to help anything... Probably a communication problem. And the way how we fixed this on SLE HA was to pull in the new package via a dependency, so that users never noticed that we split the projects. Clearly, that's impossible to do when one chooses to drop a major component for good. (Perpetuated by customers willing to pay for it, and because admittedly not all components have good test suites.) Me too, but how do we do this where all the downside doesn't fall on me? I'm not sure there's a huge downside in it for you? You'd get to develop and bring forward pacemaker 2.x all you want - and if RHEL7 wanted to freeze a specific version, they'd support 2.x.y for that. (OK, so that would probably be you too, though.) Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Release model
On 28/06/2013, at 8:59 PM, Lars Marowsky-Bree l...@suse.com wrote: On 2013-06-28T18:41:35, Andrew Beekhof and...@beekhof.net wrote: There's an exception: dropping commonly used external interfaces (say, ptest) needs to be announced a few releases in advance before enacted upstream. (And if Enterprise distributions want to keep something, they have time to prepare for that.) And of course, if major components get rewritten, they either need more testing or should be in place in parallel for 1 or 2 releases. Now we start to diverge... Keeping two lrmd's around? Two stonithd's? Well, I can dream, can't I. ;-) Of course. But perhaps you're right. The LRM rewrite taught us something about the perils of rewriting components that are badly documented and don't have good regression tests and where not all options they supported were written down somewhere. But as an isolated component, would it have been so difficult to ship a separate implementation of the LRM first, perhaps as a compile time switch? Honestly - more than likely it would have been. Looking at the commit for just the glue between crmd and lrmd: # git show --stat 6f8f559 crmd/lrm.c commit 6f8f5594a940b015cf7fadb26c1da7e110d73103 Author: David Vossel dvos...@redhat.com Date: Wed May 30 17:42:37 2012 -0500 High: crmd: Enable use of new lrmd daemon and client library in crmd. crmd/lrm.c | 687 +++--- 1 file changed, 274 insertions(+), 413 deletions(-) there is certainly some easy stuff to mask: * HA_OK - lrmd_ok * lrm_free_rsc() - lrmd_free_rsc_info() But there's also some fundamental changes to the crmd/lrmd interaction. (Assuming the interface to the component doesn't change so much. It could hardly have been worse than supporting all those different messaging APIs and their versions.) The latter is, perhaps, not a bad example. Not a bad example, but the different messaging APIs have/had an expected future lifespan of more than a release or two. Or two copies of the PE after I rewrite ordering constraints? Urgh :-( The PE is different; almost all of its features are documented and protected by strong regressions tests. That support for an option would be dropped by accident is almost unthinkable. Hence, the implementation can be considered almost entirely internal. But people were using options that the new LRM no longer supported, called lrmadmin in some of their scripts, etc. So I think the differentiation between the PE and the LRM does exist. Agreed. Perhaps the lesson is Write regression tests before a rewrite. (And I'm not saying it's a lesson that depended entirely on you or David. If cluster-glue's LRM had had such a suite, it'd certainly have helped tons.) I think he did actually. But given that some features were an undocumented distant memory, it was hard for him to know to write a test for it. The Linux kernel 3.x series seems to be coping quite nicely, too. They do have stable series to which they backport, though. That's always an option: if $someone feels the need to do longer support for, say, 1.1.10, they can always can help start 1.1.10.x. If that sort of thing wasn't such a PITA you'd have done it with 1.1.8. Yeah, and there were some here who advocated this. Given the scope of the other changes at the time, I thought it better to integrate it via a different path into SLE HA. Which is the problem with the Firefox model - either there is no good time to make them, or users hate us because we can make them at any time. For Firefox, though, I've never noticed a problem (and I'm an ardent follower of the updates). The exceptions are, of course, add-ons: so I don't update until the add-ons I depend on are also updated. Even broadcasting changes can have limited value. To use a recent example, crmsh was left in place for well over a year (iirc) before it was dropped. That didn't seem to help anything... Probably a communication problem. And the way how we fixed this on SLE HA was to pull in the new package via a dependency, so that users never noticed that we split the projects. Clearly, that's impossible to do when one chooses to drop a major component for good. (Perpetuated by customers willing to pay for it, and because admittedly not all components have good test suites.) Me too, but how do we do this where all the downside doesn't fall on me? I'm not sure there's a huge downside in it for you? Ok, lets take attrd for example - which I've been wanted to rewrite to be truly atomic for half a decade or more. Under this model, not only do I have to find the time to write and test the new addition, but I also have to: * keep maintaining the old code until... when? * probably write
Re: [Pacemaker] Release model
Hi Lars, On Fri, Jun 28, 2013 at 12:59:22PM +0200, Lars Marowsky-Bree wrote: [...] If cluster-glue's LRM had had such a suite, it'd certainly have helped tons.) It did have a regression suite. Thanks, Dejan ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Release model
On 2013-06-28T14:49:06, Dejan Muhamedagic deja...@fastmail.fm wrote: If cluster-glue's LRM had had such a suite, it'd certainly have helped tons.) It did have a regression suite. Yes, well, but it didn't test for LRM_MAX_CHILDREN or the secret support, for example. So it didn't really document the interface completely either. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Release model
On 2013-06-28T22:04:48, Andrew Beekhof and...@beekhof.net wrote: I think he did actually. Well, yes, but the hg history or reading the existing code would probably have been quite helpful. I'll take not well documented, but it's hard to say the rewrite was handled very well. But I don't want to get drawn into this too much though, it's side tracking this discussion. And done by now. And hopefully we'll not do something like that again ;-) I'm not sure there's a huge downside in it for you? Ok, lets take attrd for example - which I've been wanted to rewrite to be truly atomic for half a decade or more. If it's rewritten in a way that doesn't affect external users but that can be covered well by tests, I'd not think that having two versions of the code in parallel would make sense, yes. In my perfect world, under this model, RH would dip into the releases and take every 2nd or 3rd, whatever was ready at the time. Yes, so effectively what SLE HA is already doing. Why not live in that perfect world now! ;-) Btw. _IF_ we do this, I'd be wanting to go with Pacemaker-$x (no .y or .z). We shouldn't create the impression of doing release series when we're not. I was mostly stealing the numbering scheme from the Linux kernel. But if you're mostly thinking about this in terms of Firefox, sure. I don't really mind. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Release model
On 06/28/2013 08:04 AM, Andrew Beekhof wrote: Under this model, not only do I have to find the time to write and test the new addition, but I also have to: * keep maintaining the old code until... when? * probably write and maintain a compatibility layer * make it possible to choose which gets used (a small but annoying task) * make it possible to figure out which was in use for support * educate people that there are two, and when to chose one over the other * answer copious emails from confused users Won't all this need to be settled before RHEL 7 ships anyway? There will be a 10 year support life of the version shipped on day 1, and I quite strongly expect that pcmk will have long-ago moved on while this version still needs supporting. Or will that part of things be handled but a team in RHEL, as you mentioned earlier about it being up to the EL people to sort out? I've got no opinion either way, I am just curious as a user how the mechanics of this work/will work. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Release model
On 28/06/2013, at 11:37 PM, Lars Marowsky-Bree l...@suse.com wrote: I'm not sure there's a huge downside in it for you? Ok, lets take attrd for example - which I've been wanted to rewrite to be truly atomic for half a decade or more. If it's rewritten in a way that doesn't affect external users but that can be covered well by tests, I'd not think that having two versions of the code in parallel would make sense, yes. attrd is quite tough to write unit tests for - almost all of its functionality requires multiple nodes. hence why i picked it for illustration ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org