Re: High Availability command line interface - future plans.

2013-11-06 Thread Nick Veitch
just my tuppence...

Would it not be clearer to add an additional command to implement your
proposal? E.g. "add-manager" and possibly "destroy/remove-manager"
This could also support switches for later fine control, and possibly
be less open to misinterpretation than overloading the add-machine
command?

Nick

On Wed, Nov 6, 2013 at 6:49 PM, roger peppe  wrote:
> The current plan is to have a single "juju ensure-ha-state" juju
> command. This would create new state server machines if there are less
> than the required number (currently 3).
>
> Taking that as given, I'm wondering what we should do
> in the future, when users require more than a single
> big On switch for HA.
>
> How does the user:
>
> a) know about the HA machines so the costs of HA are not hidden, and that
> the implications of particular machine failures are clear?
>
> b) fix the system when a machine dies?
>
> c) scale up the system to x thousand nodes?
>
> d) scale down the system?
>
> For a), we could tag a machine in the status as a "state server", and
> hope that the user knows what that means.
>
> For b) the suggestion is that the user notice that a state server machine
> is non-responsive (as marked in status) and runs destroy-machine on it,
> which will notice that it's a state server machine and automatically
> start another one to replace it. Destroy-machine would refuse to work
> on a state server machine that seems to be alive.
>
> For c) we could add a flag to ensure-ha-state suggesting a desired number
> of state-server nodes.
>
> I'm not sure what the suggestion is for d) given that we refuse to
> destroy live state-server machines.
>
> Although ensure-ha-state might be a fine way to turn
> on HA initially I'm not entirely happy with expanding it to cover
> all the above cases. It seems to me like we're going
> to create a leaky abstraction that purports to be magic ("just wave the
> HA wand!") and ends up being limiting, and in some cases confusing
> ("Huh? I asked to destroy that machine and there's another one
> just been created")
>
> I believe that any user that's using HA will need to understand that
> some machines are running state servers, and when things fail, they
> will need to manage those machines individually (for example by calling
> destroy-machine).
>
> I also think that the solution to c) is limiting, because there is
> actually no such thing as a "state server" - we have at least three
> independently scalable juju components (the database servers (mongodb),
> the API servers and the environment managers) with different scaling
> characteristics. I believe that in any sufficiently large environment,
> the user will not want to scale all of those at the same rate. For example
> MongoDB will allow at most 12 members of a replica set, but a caching API
> server could potentially usefully scale up much higher than that. We could
> add more flags to ensure-ha-state (e.g.--state-server-count) but we then
> we'd lack the capability to suggest which might be grouped with which.
>
> PROPOSAL
>
> My suggestion is that we go for a "slightly less magic" approach.
> that provides the user with the tools to manage
> their own high availability set up, adding appropriate automation in time.
>
> I suggest that we let the user know that machines can run as juju server
> nodes, and provide them with the capability to *choose* which machines
> will run as server nodes and which can host units - that is, what *jobs*
> a machine will run.
>
> Here's a possible proposal:
>
> We already have an "add-machine" command. We'd add a "--jobs" flag
> to allow the user to specify the jobs that the new machine(s) will
> run. Initially we might have just two jobs, "manager" and "unit"
> - the machine can either host service units, or it can manage the
> juju environment (including running the state server database),
> or both. In time we could add finer levels of granularity to allow
> separate scalability of juju server components, without losing backwards
> compatibility.
>
> If the new machine is marked as a "manager", it would run a mongo
> replica set peer. This *would* mean that it would be possible to have
> an even number of mongo peers, with the potential for a split vote
> if the nodes were partitioned evenly, and resulting database stasis.
> I don't *think* that would actually be a severe problem in practice.
> We would make juju status point out the potential problem very clearly,
> just as it should point out the potential problem if one of an existing
> odd-sized replica set dies. The potential problems are the same in both
> cases, and are straightforward for even a relatively naive user to avoid.
>
> Thus, juju ensure-ha-state is almost equivalent to:
>
> juju add-machine --jobs manager -n 2
>
> In my view, this command feels less "magic" than ensure-ha-state - the
> runtime implication (e.g. cost) of what's going on are easier for the
> user to understand and it requires no new entities in a user's model of
> the sy

Re: High Availability command line interface - future plans.

2013-11-06 Thread Nate Finch
The answer to "how does the user know how to X?" is the same as it always
has been.  Documentation.  Now, that's not to say that we still don't need
to do some work to make it intuitive... but I think that for something that
is complicated like HA, leaning on documentation a little more is ok.

More inline:

On Wed, Nov 6, 2013 at 1:49 PM, roger peppe  wrote:

> The current plan is to have a single "juju ensure-ha-state" juju
> command. This would create new state server machines if there are less
> than the required number (currently 3).
>
> Taking that as given, I'm wondering what we should do
> in the future, when users require more than a single
> big On switch for HA.
>
> How does the user:
>
> a) know about the HA machines so the costs of HA are not hidden, and that
> the implications of particular machine failures are clear?
>

- As above, documentation about what it means when you see servers in juju
status labelled as "Juju State Server" (or whatever).

- Have actual feedback from commands:

$ juju bootstrap --high-availability
Machines 0, 1, and 2 provisioned as juju server nodes.
Juju successfully bootstrapped environment Foo in high availability mode.

or

$ juju bootstrap
Machine 0 provisioned as juju server node.
Juju successfully bootstrapped environment Foo.

$ juju ensure-ha -n 7
Enabling high availability mode with 7 juju servers.
Machines 1, 2, 3, 4, 5, and 6 provisioned as additional Juju server nodes.

$ juju ensure-ha -n 5
Reducing number of Juju server nodes to 5.
Machines 2 and 6 destroyed.

b) fix the system when a machine dies?
>

$ juju destroy-machine 5
Destroyed machine/5.
Automatically replacing destroyed Juju server node.
Machine/8 created as new Juju server node.


> c) scale up the system to x thousand nodes


Hopefully 12 machines is plenty of Juju servers for 5000 nodes.  We will
need to revisit this if it's not, but it seems like it should be plenty.
 As above, I think a simple -n is fine for both raising and lowering the
number of state servers.  If we get to the point of needing more than


> d) scale down the system?
>

 $ juju disable-ha -y
Destroyed machine/1 and machine/2.
The Juju server node for environment Foo is machine/0.
High availability mode disabled for Juju environment Foo.
-- 
Juju mailing list
Juju@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju


Re: High Availability command line interface - future plans.

2013-11-06 Thread Kapil Thangavelu
On Thu, Nov 7, 2013 at 2:49 AM, roger peppe  wrote:

> The current plan is to have a single "juju ensure-ha-state" juju
> command. This would create new state server machines if there are less
> than the required number (currently 3).
>
> Taking that as given, I'm wondering what we should do
> in the future, when users require more than a single
> big On switch for HA.
>
> How does the user:
>
> a) know about the HA machines so the costs of HA are not hidden, and that
> the implications of particular machine failures are clear?
>
> b) fix the system when a machine dies?
>
> c) scale up the system to x thousand nodes?
>
> d) scale down the system?
>
For a), we could tag a machine in the status as a "state server", and
> hope that the user knows what that means.
>
> For b) the suggestion is that the user notice that a state server machine
> is non-responsive (as marked in status) and runs destroy-machine on it,
> which will notice that it's a state server machine and automatically
> start another one to replace it. Destroy-machine would refuse to work
> on a state server machine that seems to be alive.
>
> For c) we could add a flag to ensure-ha-state suggesting a desired number
> of state-server nodes.
>
> I'm not sure what the suggestion is for d) given that we refuse to
> destroy live state-server machines.
>
> Although ensure-ha-state might be a fine way to turn
> on HA initially I'm not entirely happy with expanding it to cover
> all the above cases. It seems to me like we're going
> to create a leaky abstraction that purports to be magic ("just wave the
> HA wand!") and ends up being limiting, and in some cases confusing
> ("Huh? I asked to destroy that machine and there's another one
> just been created")
>
> I believe that any user that's using HA will need to understand that
> some machines are running state servers, and when things fail, they
> will need to manage those machines individually (for example by calling
> destroy-machine).
>
> I also think that the solution to c) is limiting, because there is
> actually no such thing as a "state server" - we have at least three
> independently scalable juju components (the database servers (mongodb),
> the API servers and the environment managers) with different scaling
> characteristics. I believe that in any sufficiently large environment,
> the user will not want to scale all of those at the same rate. For example
> MongoDB will allow at most 12 members of a replica set, but a caching API
> server could potentially usefully scale up much higher than that. We could
> add more flags to ensure-ha-state (e.g.--state-server-count) but we then
> we'd lack the capability to suggest which might be grouped with which.
>
> PROPOSAL
>
> My suggestion is that we go for a "slightly less magic" approach.
> that provides the user with the tools to manage
> their own high availability set up, adding appropriate automation in time.
>
> I suggest that we let the user know that machines can run as juju server
> nodes, and provide them with the capability to *choose* which machines
> will run as server nodes and which can host units - that is, what *jobs*
> a machine will run.
>
> Here's a possible proposal:
>
> We already have an "add-machine" command. We'd add a "--jobs" flag
> to allow the user to specify the jobs that the new machine(s) will
> run. Initially we might have just two jobs, "manager" and "unit"
> - the machine can either host service units, or it can manage the
> juju environment (including running the state server database),
> or both. In time we could add finer levels of granularity to allow
> separate scalability of juju server components, without losing backwards
> compatibility.
>
> If the new machine is marked as a "manager", it would run a mongo
> replica set peer. This *would* mean that it would be possible to have
> an even number of mongo peers, with the potential for a split vote
> if the nodes were partitioned evenly, and resulting database stasis.
> I don't *think* that would actually be a severe problem in practice.
> We would make juju status point out the potential problem very clearly,
> just as it should point out the potential problem if one of an existing
> odd-sized replica set dies. The potential problems are the same in both
> cases, and are straightforward for even a relatively naive user to avoid.
>
> Thus, juju ensure-ha-state is almost equivalent to:
>
> juju add-machine --jobs manager -n 2
>
> In my view, this command feels less "magic" than ensure-ha-state - the
> runtime implication (e.g. cost) of what's going on are easier for the
> user to understand and it requires no new entities in a user's model of
> the system.
>
> In addition to the new add-machine flag, we'd add a single new command,
> "juju machine-jobs", which would allow the user to change the jobs
> associated with an existing machine.  That could be a later addition -
> it's not necessary in the first cut.
>
> With these primitives, I *think* the responsib

Re: High Availability command line interface - future plans.

2013-11-06 Thread Nate Finch
Oops, missed the end of a thought there.  If we get to the point of needing
more than 12 server nodes (not unfathomable), then we have to start doing
some more work for our "hyperscale" customers, which will probably involve
much more customization and require much more knowledge of the system.

I think one of the points of making HA simple is that we don't want people
to have to learn how Juju works before they can deploy their own stuff in a
robust manner.  Keep the barrier of entry as low as possible.  We can give
general guidelines about how many Juju servers you need for N unit agents,
and then people will know what to set N to, when they do juju ensure-ha -n.

I think most people will be happy knowing there are N servers out there,
and if one goes down, another will take its place. They don't want to know
about this job and that job.  Just make it work and let me get on with my
life. That's kind of the whole point of Juju, right?


On Wed, Nov 6, 2013 at 2:56 PM, Nate Finch  wrote:

> The answer to "how does the user know how to X?" is the same as it always
> has been.  Documentation.  Now, that's not to say that we still don't need
> to do some work to make it intuitive... but I think that for something that
> is complicated like HA, leaning on documentation a little more is ok.
>
> More inline:
>
> On Wed, Nov 6, 2013 at 1:49 PM, roger peppe  wrote:
>
>> The current plan is to have a single "juju ensure-ha-state" juju
>> command. This would create new state server machines if there are less
>> than the required number (currently 3).
>>
>> Taking that as given, I'm wondering what we should do
>> in the future, when users require more than a single
>> big On switch for HA.
>>
>> How does the user:
>>
>> a) know about the HA machines so the costs of HA are not hidden, and that
>> the implications of particular machine failures are clear?
>>
>
> - As above, documentation about what it means when you see servers in juju
> status labelled as "Juju State Server" (or whatever).
>
> - Have actual feedback from commands:
>
> $ juju bootstrap --high-availability
> Machines 0, 1, and 2 provisioned as juju server nodes.
> Juju successfully bootstrapped environment Foo in high availability mode.
>
> or
>
> $ juju bootstrap
> Machine 0 provisioned as juju server node.
> Juju successfully bootstrapped environment Foo.
>
> $ juju ensure-ha -n 7
> Enabling high availability mode with 7 juju servers.
> Machines 1, 2, 3, 4, 5, and 6 provisioned as additional Juju server nodes.
>
> $ juju ensure-ha -n 5
> Reducing number of Juju server nodes to 5.
> Machines 2 and 6 destroyed.
>
> b) fix the system when a machine dies?
>>
>
> $ juju destroy-machine 5
> Destroyed machine/5.
> Automatically replacing destroyed Juju server node.
> Machine/8 created as new Juju server node.
>
>
>> c) scale up the system to x thousand nodes
>
>
> Hopefully 12 machines is plenty of Juju servers for 5000 nodes.  We will
> need to revisit this if it's not, but it seems like it should be plenty.
>  As above, I think a simple -n is fine for both raising and lowering the
> number of state servers.  If we get to the point of needing more than
>
>
>> d) scale down the system?
>>
>
>  $ juju disable-ha -y
> Destroyed machine/1 and machine/2.
> The Juju server node for environment Foo is machine/0.
> High availability mode disabled for Juju environment Foo.
>
>
-- 
Juju mailing list
Juju@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju


Re: High Availability command line interface - future plans.

2013-11-06 Thread David Cheney
+1 (million), this solution keeps coming up, and I still feel it is
the right one.

On Thu, Nov 7, 2013 at 7:07 AM, Kapil Thangavelu
 wrote:
>
>
>
> On Thu, Nov 7, 2013 at 2:49 AM, roger peppe  wrote:
>>
>> The current plan is to have a single "juju ensure-ha-state" juju
>> command. This would create new state server machines if there are less
>> than the required number (currently 3).
>>
>> Taking that as given, I'm wondering what we should do
>> in the future, when users require more than a single
>> big On switch for HA.
>>
>> How does the user:
>>
>> a) know about the HA machines so the costs of HA are not hidden, and that
>> the implications of particular machine failures are clear?
>>
>> b) fix the system when a machine dies?
>>
>> c) scale up the system to x thousand nodes?
>>
>> d) scale down the system?
>>
>> For a), we could tag a machine in the status as a "state server", and
>> hope that the user knows what that means.
>>
>> For b) the suggestion is that the user notice that a state server machine
>> is non-responsive (as marked in status) and runs destroy-machine on it,
>> which will notice that it's a state server machine and automatically
>> start another one to replace it. Destroy-machine would refuse to work
>> on a state server machine that seems to be alive.
>>
>> For c) we could add a flag to ensure-ha-state suggesting a desired number
>> of state-server nodes.
>>
>> I'm not sure what the suggestion is for d) given that we refuse to
>> destroy live state-server machines.
>>
>> Although ensure-ha-state might be a fine way to turn
>> on HA initially I'm not entirely happy with expanding it to cover
>> all the above cases. It seems to me like we're going
>> to create a leaky abstraction that purports to be magic ("just wave the
>> HA wand!") and ends up being limiting, and in some cases confusing
>> ("Huh? I asked to destroy that machine and there's another one
>> just been created")
>>
>> I believe that any user that's using HA will need to understand that
>> some machines are running state servers, and when things fail, they
>> will need to manage those machines individually (for example by calling
>> destroy-machine).
>>
>> I also think that the solution to c) is limiting, because there is
>> actually no such thing as a "state server" - we have at least three
>> independently scalable juju components (the database servers (mongodb),
>> the API servers and the environment managers) with different scaling
>> characteristics. I believe that in any sufficiently large environment,
>> the user will not want to scale all of those at the same rate. For example
>> MongoDB will allow at most 12 members of a replica set, but a caching API
>> server could potentially usefully scale up much higher than that. We could
>> add more flags to ensure-ha-state (e.g.--state-server-count) but we then
>> we'd lack the capability to suggest which might be grouped with which.
>>
>> PROPOSAL
>>
>> My suggestion is that we go for a "slightly less magic" approach.
>> that provides the user with the tools to manage
>> their own high availability set up, adding appropriate automation in time.
>>
>> I suggest that we let the user know that machines can run as juju server
>> nodes, and provide them with the capability to *choose* which machines
>> will run as server nodes and which can host units - that is, what *jobs*
>> a machine will run.
>>
>> Here's a possible proposal:
>>
>> We already have an "add-machine" command. We'd add a "--jobs" flag
>> to allow the user to specify the jobs that the new machine(s) will
>> run. Initially we might have just two jobs, "manager" and "unit"
>> - the machine can either host service units, or it can manage the
>> juju environment (including running the state server database),
>> or both. In time we could add finer levels of granularity to allow
>> separate scalability of juju server components, without losing backwards
>> compatibility.
>>
>> If the new machine is marked as a "manager", it would run a mongo
>> replica set peer. This *would* mean that it would be possible to have
>> an even number of mongo peers, with the potential for a split vote
>> if the nodes were partitioned evenly, and resulting database stasis.
>> I don't *think* that would actually be a severe problem in practice.
>> We would make juju status point out the potential problem very clearly,
>> just as it should point out the potential problem if one of an existing
>> odd-sized replica set dies. The potential problems are the same in both
>> cases, and are straightforward for even a relatively naive user to avoid.
>>
>> Thus, juju ensure-ha-state is almost equivalent to:
>>
>> juju add-machine --jobs manager -n 2
>>
>> In my view, this command feels less "magic" than ensure-ha-state - the
>> runtime implication (e.g. cost) of what's going on are easier for the
>> user to understand and it requires no new entities in a user's model of
>> the system.
>>
>> In addition to the new add-machine flag, we'd add 

Re: High Availability command line interface - future plans.

2013-11-07 Thread roger peppe
On 6 November 2013 20:07, Kapil Thangavelu
 wrote:
> instead of adding more complexity and concepts, it would be ideal if we
> could reuse the primitives we already have. ie juju environments have three
> user exposed services, that users can add-unit / remove-unit etc.  they have
> a juju prefix and therefore are omitted by default from status listing.
> That's a much simpler story to document. how do i scale my state server..
> juju add-unit juju-db... my provisioner juju add-unit juju-provisioner.

I have a lot of sympathy with this point of view. I've thought about
it quite a bit.

I see two possibilities for implementing it:

1) Keep something like the existing architecture, where machine agents can
take on managerial roles, but provide a veneer over the top which
specially interprets service operations on the juju built-in services
and translates them into operations on machine jobs.

2) Actually implement the various juju services as proper services.

The difficulty I have with 1) is that there's a significant mismatch between
the user's view of things and what's going on underneath.
For instance, with a built-in service, can I:

- add a subordinate service to it?
- see the relevant log file in the usual place for a unit?
- see its charm metadata?
- join to its juju-info relation?

If it's a single service, how can its units span different series?
(presumably it has got a charm URL, which includes the series)

I fear that if we try this approach, the cracks show through
and the result is a system that's hard to understand because
too many things are not what they appear.
And that's not even going into the plethora of special
casing that this approach would require throughout the code.

2) is more attractive, as it's actually doing what's written on the
label. But this has its own problems.

- it's a highly significant architectural change.

- juju managerial services are tightly tied into the operation
of juju itself (not surprisingly). There are many chicken and egg
problems here - we would be trying to use the system to support itself,
and that could easily lead to deadlock as one part of the system
tries to talk to another part of the system that relies on the first.
I think it *might* be possible, but it's not gonna be easy
and I suspect nasty gotchas at the end of a long development process.

- again there are inevitably going to be many special cases
throughout the code - for instance, how does a unit
acquire the credentials it needs to talk to the API
server?

It may be that a hybrid approach is possible - for example
implementing the workers as a service and still having mongo
and the API server as machine workers. I think that's
a reasonable evolutionary step from the approach I'm proposing.


The reasoning behind my proposed approach perhaps
comes from the fact that (I'm almost ashamed to admit it)
I'm a lazy programmer. I don't like creating mountains of code
where a small amount will do almost as well.

Adding the concept of jobs on machines maps very closely
to the architecture that we have today. It is a single
extra concept for the user to understand - all the other
features (e.g. add-machine and destroy-machine) are already
exposed.

I agree that in an ideal world we would scale juju meta-services
just as we would scale normal services, but I think it's actually
reasonable to have a special case here.

Allowing the user to know that machines can take on juju managerial
roles doesn't seem to be a huge ask. And we get just as much
functionality with considerably less code, which seems like a significant
win to me in terms of ongoing maintainability and agility for the future.

  cheers,
rog.

PS now not cross-posting, sorry Tim - followups to juju@lists.ubuntu.com only.

-- 
Juju mailing list
Juju@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju


Re: High Availability command line interface - future plans.

2013-11-07 Thread roger peppe
I've just realised that all the traffic for this thread was actually
in juju-dev,
so I'll revert to there. Another cross post then, my apologies.

On 7 November 2013 09:21, roger peppe  wrote:
> On 6 November 2013 20:07, Kapil Thangavelu
>  wrote:
>> instead of adding more complexity and concepts, it would be ideal if we
>> could reuse the primitives we already have. ie juju environments have three
>> user exposed services, that users can add-unit / remove-unit etc.  they have
>> a juju prefix and therefore are omitted by default from status listing.
>> That's a much simpler story to document. how do i scale my state server..
>> juju add-unit juju-db... my provisioner juju add-unit juju-provisioner.
>
> I have a lot of sympathy with this point of view. I've thought about
> it quite a bit.
>
> I see two possibilities for implementing it:
>
> 1) Keep something like the existing architecture, where machine agents can
> take on managerial roles, but provide a veneer over the top which
> specially interprets service operations on the juju built-in services
> and translates them into operations on machine jobs.
>
> 2) Actually implement the various juju services as proper services.
>
> The difficulty I have with 1) is that there's a significant mismatch between
> the user's view of things and what's going on underneath.
> For instance, with a built-in service, can I:
>
> - add a subordinate service to it?
> - see the relevant log file in the usual place for a unit?
> - see its charm metadata?
> - join to its juju-info relation?
>
> If it's a single service, how can its units span different series?
> (presumably it has got a charm URL, which includes the series)
>
> I fear that if we try this approach, the cracks show through
> and the result is a system that's hard to understand because
> too many things are not what they appear.
> And that's not even going into the plethora of special
> casing that this approach would require throughout the code.
>
> 2) is more attractive, as it's actually doing what's written on the
> label. But this has its own problems.
>
> - it's a highly significant architectural change.
>
> - juju managerial services are tightly tied into the operation
> of juju itself (not surprisingly). There are many chicken and egg
> problems here - we would be trying to use the system to support itself,
> and that could easily lead to deadlock as one part of the system
> tries to talk to another part of the system that relies on the first.
> I think it *might* be possible, but it's not gonna be easy
> and I suspect nasty gotchas at the end of a long development process.
>
> - again there are inevitably going to be many special cases
> throughout the code - for instance, how does a unit
> acquire the credentials it needs to talk to the API
> server?
>
> It may be that a hybrid approach is possible - for example
> implementing the workers as a service and still having mongo
> and the API server as machine workers. I think that's
> a reasonable evolutionary step from the approach I'm proposing.
>
>
> The reasoning behind my proposed approach perhaps
> comes from the fact that (I'm almost ashamed to admit it)
> I'm a lazy programmer. I don't like creating mountains of code
> where a small amount will do almost as well.
>
> Adding the concept of jobs on machines maps very closely
> to the architecture that we have today. It is a single
> extra concept for the user to understand - all the other
> features (e.g. add-machine and destroy-machine) are already
> exposed.
>
> I agree that in an ideal world we would scale juju meta-services
> just as we would scale normal services, but I think it's actually
> reasonable to have a special case here.
>
> Allowing the user to know that machines can take on juju managerial
> roles doesn't seem to be a huge ask. And we get just as much
> functionality with considerably less code, which seems like a significant
> win to me in terms of ongoing maintainability and agility for the future.
>
>   cheers,
> rog.
>
> PS now not cross-posting, sorry Tim - followups to juju@lists.ubuntu.com only.

-- 
Juju mailing list
Juju@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju