It doesn't feel like the difference between juju ensure-ha --prefer-machines 11,37
and juju add-state-server --to 11,37 is worth the amount of reasoning there. I'm clearly in favor of the latter, but I wouldn't argue so much for it. On Fri, Nov 8, 2013 at 2:00 PM, William Reade <william.re...@canonical.com> wrote: > I'm concerned that we're (1) rehashing decisions made during the sprint and > (2) deviating from requirements in doing so. > > In particular, abstracting HA away into "management" manipulations -- as > roger notes, pretty much isomorphic to the "jobs" proposal -- doesn't give > users HA so much as it gives them a limited toolkit with which they can > more-or-less construct their own HA; in particular, allowing people to use > an even number of state servers is strictly a bad thing [0], and I'm > extremely suspicious of any proposal that opens that door. > > Of course, some will argue that mongo should be able to scale separately > from the api servers and other management tasks, and this is a worthy goal; > but in this context it sucks us down into the morass of exposing different > types of management on different machines, and ends up approaching the jobs > proposal still closer, in that it requires users to assimilate a whole load > of extra terminology in order to perform a conceptually simple function. > > Conversely, "ensure-ha" (with possible optional --redundancy=N flag, > defaulting to 1) is a simple model that can be simply explained: the > command's sole purpose is to ensure that juju management cannot fail as a > result to the simultaneous failure of <=N machines. It's a *user-level* > construct that will always be applicable even in the context of a more > sophisticated future language (no matter what's going on with this > complicated management/jobs business, you can run that and be assured you'll > end up with at least enough manager machines to fulfil the requirement you > clearly stated in the command line). > > I haven't seen anything that makes me think that redesigning from scratch is > in any way superior to refining what we already agreed upon; and it's > distracting us from the questions of reporting and correcting manager > failure when it occurs. I assert the following series of arguments: > > * users may discover at any time that they need to make an existing > environment HA, so ensure-ha is *always* a reasonable user action > * users who *don't* need an HA environment can, by definition, afford to > take the environment down and reconstruct it without HA if it becomes > unimportant > * therefore, scaling management *down* is not the highest priority for us > (but is nonetheless easily amenable to future control via the "ensure-ha" > command -- just explicitly set a lower redundancy number) > * similarly, allowing users to *directly* destroy management machines > enables exciting new failure modes that don't really need to exist > > * the notion of HA is somewhat limited in worth when there's no way to make > a vulnerable environment robust again > * the more complexity we shovel onto the user's plate, the less likely she > is to resolve the situation correctly under stress > * the most obvious, and foolproof, command for repairing HA would be > "ensure-ha" itself, which could very reasonably take it upon itself to > replace manager nodes detected as "down" -- assuming a robust presence > implementation, which we need anyway, this (1) works trivially for machines > that die unexpectedly and (2) allows a backdoor for resolution of "weird" > situations: the user can manually shutdown a misbehaving manager > out-of-band, and run ensure-ha to cause a new one to be spun up in its > place; once HA is restored, the old machine will no longer be a manager, no > longer be indestructible, and can be cleaned up at leisure > > * the notion is even more limited when you can't even tell when something > goes wrong > * therefore, HA state should *at least* be clearly and loudly communicated > in status > * but that's not very proactive, and I'd like to see a plan for how we're > going to respond to these situations when we detect them > > * the data accessible to a manager node is sensitive, and we shouldn't > generally be putting manager nodes on dirty machines; but density is an > important consideration, and I don't think it's confusing to allow > "preferred" machines to be specified in "ensure-ha", such that *if* > management capacity needs to be added it will be put onto those machines > before finding clean ones or provisioning new ones > * strawman syntax: "juju ensure-ha --prefer-machines 11,37" to place any > additional manager tasks that may be required on the supplied machines in > order of preference -- but even this falls far behind the essential goal, > which is "make HA *easy* for our users". > * (ofc, we should continue not to put units onto manager machines by > default, but allow them when forced with --to as before) > > I don't believe that any of this precludes more sophisticated management of > juju's internal functions *when* the need becomes pressing -- whether via > jobs, or namespaced pseudo-services, or whatever -- but at this stage I > think it is far better to expose the policies we're capable of supporting, > and thus allow ourselves wiggle room to allow the mechanism to evolve, than > to define a user-facing model that is, at best, a woolly reflection of an > internal model that's likely to change as we explore the solution space in > practice. > > Long-term, FWIW, I would be happiest to expose fine control over HA, > scaling, etc by presenting juju's internal functionality as a namespaced > group of services that *can* be configured and manipulated (as much as > possible) like normal services, because... y'know... services/units is > actually a pretty good user model; but I think we're all in agreement that > we shouldn't go down that rabbit hole today. > > Cheers > William > > > [0] consider the case of 4 managers; as with 3, if any single machine goes > down the system will continue to function, but will fail once the second > dies; but the situation is strictly worse because the number of machines > that *could* fail, and thus trigger a vulnerable situation, is larger. > > > On Fri, Nov 8, 2013 at 11:31 AM, John Arbash Meinel <j...@arbash-meinel.com> > wrote: >> >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> On 2013-11-08 14:15, roger peppe wrote: >> > On 8 November 2013 08:47, Mark Canonical Ramm-Christensen >> > <mark.ramm-christen...@canonical.com> wrote: >> >> I have a few high level thoughts on all of this, but the key >> >> thing I want to say is that we need to get a meeting setup next >> >> week for the solution to get hammered out. >> >> >> >> First, conceptually, I don't believe the user model needs to >> >> match the implementation model. That way lies madness -- users >> >> care about the things they care about and should not have to >> >> understand how the system works to get something basic done. >> >> See: >> >> http://www.amazon.com/The-Inmates-Are-Running-Asylum/dp/0672326140 >> >> for reasons why I call this madness. >> >> >> >> For that reason I think the path of adding a --jobs flag to >> >> add-machine is not a move forward. It is exposing implementation >> >> detail to users and forcing them into a more complex conceptual >> >> model. >> >> >> >> Second, we don't have to boil the ocean all at once. An >> >> "ensure-ha" command that sets up additional server nodes is >> >> better than what we have now -- nothing. Nate is right, the box >> >> need not be black, we could have an juju ha-status command that >> >> just shows the state of HA. This is fundamentally different >> >> than changing the behavior and meaning of add-machines to know >> >> about juju jobs and agents and forcing folks to think about >> >> that. >> >> >> >> Third, we I think it is possible to chart a course from ensure-ha >> >> as a shortcut (implemented first) to the type of syntax and >> >> feature set that Kapil is talking about. And let's not kid >> >> ourselves, there are a bunch of new features in that proposal: >> >> >> >> * Namespaces for services * support for subordinates to state >> >> services * logging changes * lifecycle events on juju "jobs" * >> >> special casing the removal of services that would kill the >> >> environment * special casing the stats to know about HA and warn >> >> for even state server nodes >> >> >> >> I think we will be adding a new concept and some new syntax when >> >> we add HA to juju -- so the idea is just to make it easier for >> >> users to understand, and to allow a path forward to something >> >> like what Kapil suggests in the future. And I'm pretty solidly >> >> convinced that there is an incremental path forward. >> >> >> >> Fourth, the spelling "ensure-ha" is probably not a very good >> >> idea, the cracks in that system (like taking a -n flag, and >> >> dealing with failed machines) are already apparent. >> >> >> >> I think something like Nick's proposal for "add-manager" would be >> >> better. Though I don't think that's quite right either. >> >> >> >> So, I propose we add one new idea for users -- a state-server. >> >> >> >> then you'd have: >> >> >> >> juju management --info juju management --add juju management >> >> --add --to 3 juju management --remove-from >> > >> > This seems like a reasonable approach in principle (it's >> > essentially isomorphic to the --jobs approach AFAICS which makes me >> > happy). >> > >> > I have to say that I'm not keen on using flags to switch the basic >> > behaviour of a command. The interaction between the flags can then >> > become non-obvious (for example a --constraints flag might be >> > appropriate with --add but not --remove-from). >> > >> > Ah, but your next message seems to go along with that. >> > >> > So, to couch your proposal in terms that are consistent with the >> > rest of the juju commands, here's how I see it could look, in terms >> > of possible help output from the commands: >> > >> > usage: juju add-management [options] purpose: Add Juju management >> > functionality to a machine, or start a new machine with management >> > functionality. Any Juju machine can potentially participate as a >> > Juju manager - this command adds a new such manager. Note that >> > there should always be an odd number of active management machines, >> > otherwise the Juju environment is potentially vulnerable to >> > network partitioning. If a management machine fails, a new one >> > should be started to replace it. >> >> I would probably avoid putting such an emphasis on "any machine can be >> a manager machine". But that is my personal opinion. (If you want HA >> you probably want it on dedicated nodes.) >> >> > >> > options: --constraints (= ) additional machine constraints. >> > Ignored if --to is specified. -e, --environment (= "local") juju >> > environment to operate in --series (= "") the Ubuntu series of the >> > new machine. Ignored if --to is specified. --to (="") the id of the >> > machine to add management to. If this is not specified, a new >> > machine is provisioned. >> > >> > usage: juju remove-management [options] <machine-id> purpose: >> > Remove Juju management functionality from the machine with the >> > given id. The machine itself is not destroyed. Note that if there >> > are less than three management machines remaining, the operation of >> > the Juju environment will be vulnerable to the failure of a single >> > machine. It is not possible to remove the last management machine. >> > >> >> I would probably also remove the machine if the only thing on it was >> the management. Certainly that is how people want us to do "juju >> remove-unit". >> >> >> > options: -e, --environment (= "local") juju environment to operate >> > in >> > >> > As a start, we could implement only the add-management command, and >> > not implement the --to flag. That would be sufficient for our HA >> > deliverable, I believe. The other features could be added in time >> > or according to customer demand. >> >> The main problem with this is that it feels slightly too easy to add >> just 1 machine and then not actually have HA (mongo stops allowing >> writes if you have a 2-node cluster and lose one, right?) >> >> John >> =:-> >> >> > >> >> I know this is not following the add-machine format, but I think >> >> it would be better to migrate that to something more like this: >> >> >> >> juju machine --add >> > >> > If we are going to do that, I think we should probably change all >> > the commands at once - consistency is good. >> > >> > If we do the above, could we drop "juju ensure-ha" entirely, given >> > the fact that the above commands are both easier to implement (I >> > think!) and more powerful? >> > >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.13 (Cygwin) >> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ >> >> iEYEARECAAYFAlJ8vYQACgkQJdeBCYSNAAMv7ACeJ7N8g5MeV3XE230/qjAcYE8m >> kUgAoLrJ0L1vD9zzszwgFHgI8G/gomJO >> =rl+3 >> -----END PGP SIGNATURE----- >> >> -- >> Juju-dev mailing list >> Juju-dev@lists.ubuntu.com >> Modify settings or unsubscribe at: >> https://lists.ubuntu.com/mailman/listinfo/juju-dev > > > > -- > Juju-dev mailing list > Juju-dev@lists.ubuntu.com > Modify settings or unsubscribe at: > https://lists.ubuntu.com/mailman/listinfo/juju-dev > -- gustavo @ http://niemeyer.net -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev