Re: Resource Management/Locking [was: Re: What would be your ideal solution?]

2013-11-26 Thread Pierre-Yves Ritschard
Hi Everyone,

First off, I'm really excited that there is an undergoing discussion on
these issues.
I agree with john that CAP provides a good "framework" for looking at the
individual properties of the distributed system that cloudstack is, as a
whole. The separation between an orchestration layer and automation layer
is also a valid abstraction of the main roles of the management server.

As far as CAP properties are concerned, I don't think there is much
question that the aim is for:

* a CP orchestration layer (it will continue to rely on a CP system: an
RDBMS)
* an AP automation layer (it is tied to an AP system, a cluster of
hypervisors)

As far as operations are concerned I think the plugin approach in CS is
great, it allows to distribute a very simple system to start with, where a
single management server will most likely run. In largely distributed
systems it is certainly not a crazy requirement to rely on zookeeper, in
many shops using CS, ZK is already used anyhow, operation-wise, it is not
more complex than, say, maintaining a highly available MySQL cluster.

Before I go on, I'll just acknowledge here that I'm not addressing the
issue of compatibility, all approaches discussed so far, except Darren's do
not concern themselves with compatibility and upgrades which will be a
major pain if the persistence layer / data store evolves in any significant
way. I know this is a big concern for CS users and citrix, and will need to
be taken into account, I don't have a clear picture of how this could be
done.

As far as persistence is concerned, there are different things that CS
stores which have different requirements:

* Organizational data needs strong consistency: users, accounts, domains,
projects, configuration (for networks, templates, ...)
* Transient resource data (vm running status) can only have eventual
consistency
* Usage data only requires eventual consistency (and does not need to
clutter the main data store)

I think one of the reasons for the head-scratching around resources right
now is that the persistence layer is right now used both for storing the
expected state of resources and their actual state, maybe their should be a
transient persistence layer used for storing known states.

So to sum up, as far as storage is concerned it might be easier to reason
about CS in terms of three different persistence layer:

* A main layer for organizational data, expected state and last known state
* A layer for storing state as reported by resource owners (hypervisors)
* A mechanism for distributing usage data

With such a system, the mailbox approach is possible. I do think that the
amount of work in CS would be huge and that we would risk ending up with a
franken-erlang type system which java doesn't lend itself well too (surely
scala could but this would imply a total rewrite).

An intermediate step could be to look at resources the same way Apache
Kafka does (or in a way Apache Cassandra). Managers could be seen as a
homogeneous clusters responsible for an nth of the cluster (for a cluster
of n managers). A good mechanism is needed for agreeing on cluster
membership, but there are several proven and valid approaches for this (and
its a problem that lends itself well to the plugin approach in CS).

A typical incoming API request would thus hit any management node, which
could either issue a redirect to the correct node, proxy it to the correct
node or create a jobid and let the client query the jobid for its status.

The upside of this approach is that it still makes it possible for CS to
become the jenkins of cloud controllers (it would need an HSQLDB option for
persistence though !) and rely on proven and well understood projects for
larger deployments (like ZK, or when it stabilizes, an implementation of
raft).

A first step towards this would be to have some sort of agreement on the
different layers of persistence needed throughout CS and try to move
forward. I can get my hands dirty and try to evolve the Dao stuff that is
everywhere in CS, but I'd like to know I'm not going towards a dead-end.








On Mon, Nov 25, 2013 at 10:18 PM, Darren Shepherd <
darren.s.sheph...@gmail.com> wrote:

> Okay, I'll have to stew over this for a bit.  My one general comment is
> that it seems complicated.  Such a system seems like it would take a good
> amount of effort to construct properly and as such it's a risky endeavour.
>
> Darren
>
>
> On Mon, Nov 25, 2013 at 12:10 PM, John Burwell  wrote:
>
> > Darren,
> >
> > In a peer-to-peer model such as I describe, active-active is and is not a
> > concept.  The supervision tree is responsible for identifying failure,
> and
> > initiating process re-allocation for failed resources.  For example, if a
> > pod’s management process crashed, it would also crash all of the
> processes
> > managing the hosts in that pod.  The zone would then attempt to restart
> the
> > pod’s management process (either local to the zone supervisor or on a
> > remote instance whi

Re: Resource Management/Locking [was: Re: What would be your ideal solution?]

2013-11-25 Thread Darren Shepherd
Okay, I'll have to stew over this for a bit.  My one general comment is
that it seems complicated.  Such a system seems like it would take a good
amount of effort to construct properly and as such it's a risky endeavour.

Darren


On Mon, Nov 25, 2013 at 12:10 PM, John Burwell  wrote:

> Darren,
>
> In a peer-to-peer model such as I describe, active-active is and is not a
> concept.  The supervision tree is responsible for identifying failure, and
> initiating process re-allocation for failed resources.  For example, if a
> pod’s management process crashed, it would also crash all of the processes
> managing the hosts in that pod.  The zone would then attempt to restart the
> pod’s management process (either local to the zone supervisor or on a
> remote instance which could be configurable) until it was able to start
> “ready” process for the child resource.
>
> This model requires a “special” root supervisor that is controlled by the
> orchestration tier which can identify when a zone supervisor becomes
> unavailable, and attempts to restart it.  The ownership of this “special”
> supervisor will require a consensus mechanism amongst the orchestration
> tier processes to elect an owner of the process and determine when a new
> owner needs to be elected (e.g. a Raft implementation such as barge [1]).
>  Given the orchestration tier is designed as an AP system, an orchestration
> tier process should be able to be an owner (i.e. the operator is not
> required to identify a “master” node).  There are likely other potential
> topologies (e.g. a root supervisor per zone rather than one for all zones),
> but in all cases ownership election would be the same.  Most importantly,
> there are no data durability requirements in this claim model.  When an
> orchestration process becomes unable to continue owning a root supervisor,
> the other orchestration processes recognize the missing owner and initiate
> ownership claim the process for the partition.
>
> In all failure scenarios, the supervision tree must be rebuilt from the
> point of failure downward using the process allocation process I previously
> described.  For an initial implementation, I would recommend taking simply
> throwing any parts of the supervision tree that are already running in the
> event of a widespread failure (e.g. a zone with many pods).  Dependent on
> the recovery time and SLAs, a future optimization may be to re-attach
> “orphaned” branches of the previous tree to the tree being built as part of
> the recovery process (e.g. loss a zone supervisor due to a switch failure).
>  Additionally, the system would also need a mechanism to hand-off ownership
> of the root supervisor for planned outages (hardware
> upgrades/decommissioning, maintenance windows, etc).
>
> Again, caveated with a a few hand waves, the idea is to build up a
> peer-to-peer management model that provides strict serialization
> guarantees.  Fundamentally, it utilizes a tree of processes to provide
> exclusive access, distribute work, and ensure availability requirements
> when partitions occur.  Details would need to be worked out for the best
> application to CloudStack (e.g root node ownership and orchestration tier
> gossip), but we would be implementing well-trod distributed systems
> concepts in the context cloud orchestration (sounds like a fun thing to do
> …).
>
> Thanks,
> -John
>
> [1]: https://github.com/mgodave/barge
>
> P.S. I see the libraries/frameworks referenced as the building blocks to a
> solution, but none of them (in whole or combination) solves the problem
> completely.
>
> On Nov 25, 2013, at 12:29 PM, Darren Shepherd 
> wrote:
>
> I will ask one basic question.  How do you forsee managing one mailbox per
> resource.  If I have multiple servers running in an active-active mode, how
> do you determine which server has the mailbox?  Do you create actors on
> demand?  How do you synchronize that operation?
>
> Darren
>
>
> On Mon, Nov 25, 2013 at 10:16 AM, Darren Shepherd <
> darren.s.sheph...@gmail.com> wrote:
>
>> You bring up some interesting points.  I really need to digest this
>> further.  From a high level I think I agree, but there are a lot of implied
>> details of what you've said.
>>
>> Darren
>>
>>
>> On Mon, Nov 25, 2013 at 8:39 AM, John Burwell  wrote:
>>
>>> Darren,
>>>
>>> I originally presented my thoughts on this subject at CCC13 [1].
>>>  Fundamentally, I see CloudStack as having two distinct tiers —
>>> orchestration management and automation control.  The orchestration tier
>>> coordinates the automation control layer to fulfill user goals (e.g. create
>>> a VM instance, alter a network route, snapshot a volume, etc) constrained
>>> by policies defined by the operator (e.g. multi-tenacy boundaries, ACLs,
>>> quotas, etc).  This layer must always be available to take new requests,
>>> and to report the best available infrastructure state information.  Since
>>> execution of work is guaranteed on completion of a request, this layer

Re: Resource Management/Locking [was: Re: What would be your ideal solution?]

2013-11-25 Thread John Burwell
Edison,

The CAP theorem applies to all distributed systems.  One “master” controlling a 
bunch of a hypervisors being directed by orchestration engine + Zookeeper is a 
distributed system.  In this case, a consistent system.  In my very brief 
reading of it, CloudStack would need multiple Mesos masters to provide 
availability in event of zone or pod failures.  It would run into the same 
issue explicit locking issues I previously described — ensuring the underlying 
Zookeeper infrastructure can maintain quorum in the face of a zone and/or pod 
failures.  While it is possible to achieve, it would greatly increase the 
complexity of CloudStack deployments.

Thanks,
-John 

On Nov 25, 2013, at 2:05 PM, Edison Su  wrote:

> Won't the architecture used by Mesos/Omega solve the resource 
> management/locking issue:
> http://mesos.apache.org/documentation/latest/mesos-architecture/
> http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
> Basically, one server holds all the resource information in memory 
> (cpu/memory/disk/ip address etc) about the whole data center, all the 
> hypervisor hosts or any other resource entities are connecting to this server 
> to report/update its own resource. As there is only one master server, CAP 
> theorem is invalid.
> 
> 
>> -Original Message-
>> From: Darren Shepherd [mailto:darren.s.sheph...@gmail.com]
>> Sent: Monday, November 25, 2013 9:17 AM
>> To: John Burwell
>> Cc: dev@cloudstack.apache.org
>> Subject: Re: Resource Management/Locking [was: Re: What would be your
>> ideal solution?]
>> 
>> You bring up some interesting points.  I really need to digest this further.
>> From a high level I think I agree, but there are a lot of implied details of 
>> what
>> you've said.
>> 
>> Darren
>> 
>> 
>> On Mon, Nov 25, 2013 at 8:39 AM, John Burwell 
>> wrote:
>> 
>>> Darren,
>>> 
>>> I originally presented my thoughts on this subject at CCC13 [1].
>>> Fundamentally, I see CloudStack as having two distinct tiers -
>>> orchestration management and automation control.  The orchestration
>>> tier coordinates the automation control layer to fulfill user goals
>>> (e.g. create a VM instance, alter a network route, snapshot a volume,
>>> etc) constrained by policies defined by the operator (e.g.
>>> multi-tenacy boundaries, ACLs, quotas, etc).  This layer must always
>>> be available to take new requests, and to report the best available
>>> infrastructure state information.  Since execution of work is
>>> guaranteed on completion of a request, this layer may pend work to be
>> completed when the appropriate devices become available.
>>> 
>>> The automation control tier translates logical units of work to
>>> underlying infrastructure component APIs.  Upon completion of unit of
>>> work's execution, the state of a device (e.g. hypervisor, storage
>>> device, network switch, router, etc) matches the state managed by the
>>> orchestration tier at the time unit of work was created.  In order to
>>> ensure that the state of the underlying devices remains consistent,
>>> these units of work must be executed serially.  Permitting concurrent
>> changes to resources creates race
>>> conditions that lead to resource overcommitment and state divergence.   A
>>> symptom of this phenomenon are the myriad of scripts operators write
>>> to "synchronize" state between the CloudStack database and their
>> hypervisors.
>>> Another is the example provided below is the rapid create-destroy
>>> which can (and often does) leave dangling resources due to race
>>> conditions between the two operations.
>>> 
>>> In order to provide reliability, CloudStack vertically partitions the
>>> infrastructure into zones (independent power source/network uplink
>>> combination) sub-divided into pods (racks).  At this time, regions are
>>> largely notional, as such, as are not partitions at this time.
>>> Between the user's zone selection and our allocators distribution of
>>> resources across pods, the system attempts to distribute resources
>>> widely as possible across these partitions to provide resilience
>>> against a variety infrastructure failures (e.g. power loss, network
>>> uplink disruption, switch failures, etc).  In order maximize this
>>> resilience, the control plane (orchestration
>>> + automation tiers) must be to operate on all available partitions.
>>> + For
>>> example, if we have two (2) zones (A & B) and twenty (

Re: Resource Management/Locking [was: Re: What would be your ideal solution?]

2013-11-25 Thread John Burwell
Darren,

In a peer-to-peer model such as I describe, active-active is and is not a 
concept.  The supervision tree is responsible for identifying failure, and 
initiating process re-allocation for failed resources.  For example, if a pod’s 
management process crashed, it would also crash all of the processes managing 
the hosts in that pod.  The zone would then attempt to restart the pod’s 
management process (either local to the zone supervisor or on a remote instance 
which could be configurable) until it was able to start “ready” process for the 
child resource.  

This model requires a “special” root supervisor that is controlled by the 
orchestration tier which can identify when a zone supervisor becomes 
unavailable, and attempts to restart it.  The ownership of this “special” 
supervisor will require a consensus mechanism amongst the orchestration tier 
processes to elect an owner of the process and determine when a new owner needs 
to be elected (e.g. a Raft implementation such as barge [1]).  Given the 
orchestration tier is designed as an AP system, an orchestration tier process 
should be able to be an owner (i.e. the operator is not required to identify a 
“master” node).  There are likely other potential topologies (e.g. a root 
supervisor per zone rather than one for all zones), but in all cases ownership 
election would be the same.  Most importantly, there are no data durability 
requirements in this claim model.  When an orchestration process becomes unable 
to continue owning a root supervisor, the other orchestration processes 
recognize the missing owner and initiate ownership claim the process for the 
partition.

In all failure scenarios, the supervision tree must be rebuilt from the point 
of failure downward using the process allocation process I previously 
described.  For an initial implementation, I would recommend taking simply 
throwing any parts of the supervision tree that are already running in the 
event of a widespread failure (e.g. a zone with many pods).  Dependent on the 
recovery time and SLAs, a future optimization may be to re-attach “orphaned” 
branches of the previous tree to the tree being built as part of the recovery 
process (e.g. loss a zone supervisor due to a switch failure).  Additionally, 
the system would also need a mechanism to hand-off ownership of the root 
supervisor for planned outages (hardware upgrades/decommissioning, maintenance 
windows, etc).

Again, caveated with a a few hand waves, the idea is to build up a peer-to-peer 
management model that provides strict serialization guarantees.  Fundamentally, 
it utilizes a tree of processes to provide exclusive access, distribute work, 
and ensure availability requirements when partitions occur.  Details would need 
to be worked out for the best application to CloudStack (e.g root node 
ownership and orchestration tier gossip), but we would be implementing 
well-trod distributed systems concepts in the context cloud orchestration 
(sounds like a fun thing to do …).

Thanks,
-John

[1]: https://github.com/mgodave/barge

P.S. I see the libraries/frameworks referenced as the building blocks to a 
solution, but none of them (in whole or combination) solves the problem 
completely.

On Nov 25, 2013, at 12:29 PM, Darren Shepherd  
wrote:

> I will ask one basic question.  How do you forsee managing one mailbox per 
> resource.  If I have multiple servers running in an active-active mode, how 
> do you determine which server has the mailbox?  Do you create actors on 
> demand?  How do you synchronize that operation?
> 
> Darren
> 
> 
> On Mon, Nov 25, 2013 at 10:16 AM, Darren Shepherd 
>  wrote:
> You bring up some interesting points.  I really need to digest this further.  
> From a high level I think I agree, but there are a lot of implied details of 
> what you've said.
> 
> Darren
> 
> 
> On Mon, Nov 25, 2013 at 8:39 AM, John Burwell  wrote:
> Darren,
> 
> I originally presented my thoughts on this subject at CCC13 [1].  
> Fundamentally, I see CloudStack as having two distinct tiers — orchestration 
> management and automation control.  The orchestration tier coordinates the 
> automation control layer to fulfill user goals (e.g. create a VM instance, 
> alter a network route, snapshot a volume, etc) constrained by policies 
> defined by the operator (e.g. multi-tenacy boundaries, ACLs, quotas, etc).  
> This layer must always be available to take new requests, and to report the 
> best available infrastructure state information.  Since execution of work is 
> guaranteed on completion of a request, this layer may pend work to be 
> completed when the appropriate devices become available.
> 
> The automation control tier translates logical units of work to underlying 
> infrastructure component APIs.  Upon completion of unit of work’s execution, 
> the state of a device (e.g. hypervisor, storage device, network switch, 
> router, etc) matches the state managed by the orchestration tier at the time 
> un

RE: Resource Management/Locking [was: Re: What would be your ideal solution?]

2013-11-25 Thread Edison Su
Won't the architecture used by Mesos/Omega solve the resource 
management/locking issue:
http://mesos.apache.org/documentation/latest/mesos-architecture/
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
Basically, one server holds all the resource information in memory 
(cpu/memory/disk/ip address etc) about the whole data center, all the 
hypervisor hosts or any other resource entities are connecting to this server 
to report/update its own resource. As there is only one master server, CAP 
theorem is invalid.


> -Original Message-
> From: Darren Shepherd [mailto:darren.s.sheph...@gmail.com]
> Sent: Monday, November 25, 2013 9:17 AM
> To: John Burwell
> Cc: dev@cloudstack.apache.org
> Subject: Re: Resource Management/Locking [was: Re: What would be your
> ideal solution?]
> 
> You bring up some interesting points.  I really need to digest this further.
> From a high level I think I agree, but there are a lot of implied details of 
> what
> you've said.
> 
> Darren
> 
> 
> On Mon, Nov 25, 2013 at 8:39 AM, John Burwell 
> wrote:
> 
> > Darren,
> >
> > I originally presented my thoughts on this subject at CCC13 [1].
> >  Fundamentally, I see CloudStack as having two distinct tiers -
> > orchestration management and automation control.  The orchestration
> > tier coordinates the automation control layer to fulfill user goals
> > (e.g. create a VM instance, alter a network route, snapshot a volume,
> > etc) constrained by policies defined by the operator (e.g.
> > multi-tenacy boundaries, ACLs, quotas, etc).  This layer must always
> > be available to take new requests, and to report the best available
> > infrastructure state information.  Since execution of work is
> > guaranteed on completion of a request, this layer may pend work to be
> completed when the appropriate devices become available.
> >
> > The automation control tier translates logical units of work to
> > underlying infrastructure component APIs.  Upon completion of unit of
> > work's execution, the state of a device (e.g. hypervisor, storage
> > device, network switch, router, etc) matches the state managed by the
> > orchestration tier at the time unit of work was created.  In order to
> > ensure that the state of the underlying devices remains consistent,
> > these units of work must be executed serially.  Permitting concurrent
> changes to resources creates race
> > conditions that lead to resource overcommitment and state divergence.   A
> > symptom of this phenomenon are the myriad of scripts operators write
> > to "synchronize" state between the CloudStack database and their
> hypervisors.
> >  Another is the example provided below is the rapid create-destroy
> > which can (and often does) leave dangling resources due to race
> > conditions between the two operations.
> >
> > In order to provide reliability, CloudStack vertically partitions the
> > infrastructure into zones (independent power source/network uplink
> > combination) sub-divided into pods (racks).  At this time, regions are
> > largely notional, as such, as are not partitions at this time.
> > Between the user's zone selection and our allocators distribution of
> > resources across pods, the system attempts to distribute resources
> > widely as possible across these partitions to provide resilience
> > against a variety infrastructure failures (e.g. power loss, network
> > uplink disruption, switch failures, etc).  In order maximize this
> > resilience, the control plane (orchestration
> > + automation tiers) must be to operate on all available partitions.
> > + For
> > example, if we have two (2) zones (A & B) and twenty (20) pods per
> > zone, we should be able to take and execute work in Zone A when one or
> > more pods is lost, as well as, when taking and executing work in Zone
> > B when Zone B has failed.
> >
> > CloudStack is an eventually consistent system in that the state
> > reflected in the orchestration tier will (optimistically) differ from
> > the state of the underlying infrastructure (managed by the automation
> tier).
> >  Furthermore, the system has a partitioning model to provide
> > resilience in the face of a variety of logical and physical failures.
> > However, the automation control tier requires strictly consistent
> > operations.  Based on these definitions, the system appears to violate
> > the CAP theorem [2] (Brewer!).  The separation of the system into two
> > distinct tiers isolates these characteristics, but the boundary
> > between them must be carefully implemented to ensure that t

Re: Resource Management/Locking [was: Re: What would be your ideal solution?]

2013-11-25 Thread Darren Shepherd
I will ask one basic question.  How do you forsee managing one mailbox per
resource.  If I have multiple servers running in an active-active mode, how
do you determine which server has the mailbox?  Do you create actors on
demand?  How do you synchronize that operation?

Darren


On Mon, Nov 25, 2013 at 10:16 AM, Darren Shepherd <
darren.s.sheph...@gmail.com> wrote:

> You bring up some interesting points.  I really need to digest this
> further.  From a high level I think I agree, but there are a lot of implied
> details of what you've said.
>
> Darren
>
>
> On Mon, Nov 25, 2013 at 8:39 AM, John Burwell  wrote:
>
>> Darren,
>>
>> I originally presented my thoughts on this subject at CCC13 [1].
>>  Fundamentally, I see CloudStack as having two distinct tiers —
>> orchestration management and automation control.  The orchestration tier
>> coordinates the automation control layer to fulfill user goals (e.g. create
>> a VM instance, alter a network route, snapshot a volume, etc) constrained
>> by policies defined by the operator (e.g. multi-tenacy boundaries, ACLs,
>> quotas, etc).  This layer must always be available to take new requests,
>> and to report the best available infrastructure state information.  Since
>> execution of work is guaranteed on completion of a request, this layer may
>> pend work to be completed when the appropriate devices become available.
>>
>> The automation control tier translates logical units of work to
>> underlying infrastructure component APIs.  Upon completion of unit of
>> work’s execution, the state of a device (e.g. hypervisor, storage device,
>> network switch, router, etc) matches the state managed by the orchestration
>> tier at the time unit of work was created.  In order to ensure that the
>> state of the underlying devices remains consistent, these units of work
>> must be executed serially.  Permitting concurrent changes to resources
>> creates race conditions that lead to resource overcommitment and state
>> divergence.   A symptom of this phenomenon are the myriad of scripts
>> operators write to “synchronize” state between the CloudStack database and
>> their hypervisors.  Another is the example provided below is the rapid
>> create-destroy which can (and often does) leave dangling resources due to
>> race conditions between the two operations.
>>
>> In order to provide reliability, CloudStack vertically partitions the
>> infrastructure into zones (independent power source/network uplink
>> combination) sub-divided into pods (racks).  At this time, regions are
>> largely notional, as such, as are not partitions at this time.  Between the
>> user’s zone selection and our allocators distribution of resources across
>> pods, the system attempts to distribute resources widely as possible across
>> these partitions to provide resilience against a variety infrastructure
>> failures (e.g. power loss, network uplink disruption, switch failures,
>> etc).  In order maximize this resilience, the control plane (orchestration
>> + automation tiers) must be to operate on all available partitions.  For
>> example, if we have two (2) zones (A & B) and twenty (20) pods per zone, we
>> should be able to take and execute work in Zone A when one or more pods is
>> lost, as well as, when taking and executing work in Zone B when Zone B has
>> failed.
>>
>> CloudStack is an eventually consistent system in that the state reflected
>> in the orchestration tier will (optimistically) differ from the state of
>> the underlying infrastructure (managed by the automation tier).
>>  Furthermore, the system has a partitioning model to provide resilience in
>> the face of a variety of logical and physical failures.  However, the
>> automation control tier requires strictly consistent operations.  Based on
>> these definitions, the system appears to violate the CAP theorem [2]
>> (Brewer!).  The separation of the system into two distinct tiers isolates
>> these characteristics, but the boundary between them must be carefully
>> implemented to ensure that the consistency requirements of the automation
>> tier are not leaked to the orchestration tier.
>>
>> To properly implement this boundary, I think we should split the
>> orchestration and automation control tiers into separate physical processes
>> communicating via an RPC mechanism — allowing the automation control tier
>> to completely encapsulate its work distribution model.  In my mind, the
>> tricky wicket is providing serialization and partition tolerance in the
>> automation control tier.  Realistically, there two options — explicit and
>> implicit locking models.  Explicit locking models employ an external
>> coordination mechanism to coordinate exclusive access to resources (e.g.
>> RDBMS lock pattern, ZooKeeper, Hazelcast, etc).  The challenge with this
>> model is ensuring the availability of the locking mechanism in the face of
>> partition — forcing CloudStack operators to ensure that they have deployed
>> the underlying mechanism 

Re: Resource Management/Locking [was: Re: What would be your ideal solution?]

2013-11-25 Thread Darren Shepherd
You bring up some interesting points.  I really need to digest this
further.  From a high level I think I agree, but there are a lot of implied
details of what you've said.

Darren


On Mon, Nov 25, 2013 at 8:39 AM, John Burwell  wrote:

> Darren,
>
> I originally presented my thoughts on this subject at CCC13 [1].
>  Fundamentally, I see CloudStack as having two distinct tiers —
> orchestration management and automation control.  The orchestration tier
> coordinates the automation control layer to fulfill user goals (e.g. create
> a VM instance, alter a network route, snapshot a volume, etc) constrained
> by policies defined by the operator (e.g. multi-tenacy boundaries, ACLs,
> quotas, etc).  This layer must always be available to take new requests,
> and to report the best available infrastructure state information.  Since
> execution of work is guaranteed on completion of a request, this layer may
> pend work to be completed when the appropriate devices become available.
>
> The automation control tier translates logical units of work to underlying
> infrastructure component APIs.  Upon completion of unit of work’s
> execution, the state of a device (e.g. hypervisor, storage device, network
> switch, router, etc) matches the state managed by the orchestration tier at
> the time unit of work was created.  In order to ensure that the state of
> the underlying devices remains consistent, these units of work must be
> executed serially.  Permitting concurrent changes to resources creates race
> conditions that lead to resource overcommitment and state divergence.   A
> symptom of this phenomenon are the myriad of scripts operators write to
> “synchronize” state between the CloudStack database and their hypervisors.
>  Another is the example provided below is the rapid create-destroy which
> can (and often does) leave dangling resources due to race conditions
> between the two operations.
>
> In order to provide reliability, CloudStack vertically partitions the
> infrastructure into zones (independent power source/network uplink
> combination) sub-divided into pods (racks).  At this time, regions are
> largely notional, as such, as are not partitions at this time.  Between the
> user’s zone selection and our allocators distribution of resources across
> pods, the system attempts to distribute resources widely as possible across
> these partitions to provide resilience against a variety infrastructure
> failures (e.g. power loss, network uplink disruption, switch failures,
> etc).  In order maximize this resilience, the control plane (orchestration
> + automation tiers) must be to operate on all available partitions.  For
> example, if we have two (2) zones (A & B) and twenty (20) pods per zone, we
> should be able to take and execute work in Zone A when one or more pods is
> lost, as well as, when taking and executing work in Zone B when Zone B has
> failed.
>
> CloudStack is an eventually consistent system in that the state reflected
> in the orchestration tier will (optimistically) differ from the state of
> the underlying infrastructure (managed by the automation tier).
>  Furthermore, the system has a partitioning model to provide resilience in
> the face of a variety of logical and physical failures.  However, the
> automation control tier requires strictly consistent operations.  Based on
> these definitions, the system appears to violate the CAP theorem [2]
> (Brewer!).  The separation of the system into two distinct tiers isolates
> these characteristics, but the boundary between them must be carefully
> implemented to ensure that the consistency requirements of the automation
> tier are not leaked to the orchestration tier.
>
> To properly implement this boundary, I think we should split the
> orchestration and automation control tiers into separate physical processes
> communicating via an RPC mechanism — allowing the automation control tier
> to completely encapsulate its work distribution model.  In my mind, the
> tricky wicket is providing serialization and partition tolerance in the
> automation control tier.  Realistically, there two options — explicit and
> implicit locking models.  Explicit locking models employ an external
> coordination mechanism to coordinate exclusive access to resources (e.g.
> RDBMS lock pattern, ZooKeeper, Hazelcast, etc).  The challenge with this
> model is ensuring the availability of the locking mechanism in the face of
> partition — forcing CloudStack operators to ensure that they have deployed
> the underlying mechanism in a partition tolerant manner (e.g. don’t locate
> all of the replicas in the same pod, deploy a cluster per zone, etc).
>  Additionally, the durability introduced by these mechanisms inhibits the
> self-healing due to lock staleness.
>
> In contrast, an implicit lock model structures the runtime execution model
> to provide exclusive access to a resource and model the partitioning
> scheme.  One such model is to provide a single work queue (mai

Resource Management/Locking [was: Re: What would be your ideal solution?]

2013-11-25 Thread John Burwell
Darren,

I originally presented my thoughts on this subject at CCC13 [1].  
Fundamentally, I see CloudStack as having two distinct tiers — orchestration 
management and automation control.  The orchestration tier coordinates the 
automation control layer to fulfill user goals (e.g. create a VM instance, 
alter a network route, snapshot a volume, etc) constrained by policies defined 
by the operator (e.g. multi-tenacy boundaries, ACLs, quotas, etc).  This layer 
must always be available to take new requests, and to report the best available 
infrastructure state information.  Since execution of work is guaranteed on 
completion of a request, this layer may pend work to be completed when the 
appropriate devices become available.

The automation control tier translates logical units of work to underlying 
infrastructure component APIs.  Upon completion of unit of work’s execution, 
the state of a device (e.g. hypervisor, storage device, network switch, router, 
etc) matches the state managed by the orchestration tier at the time unit of 
work was created.  In order to ensure that the state of the underlying devices 
remains consistent, these units of work must be executed serially.  Permitting 
concurrent changes to resources creates race conditions that lead to resource 
overcommitment and state divergence.   A symptom of this phenomenon are the 
myriad of scripts operators write to “synchronize” state between the CloudStack 
database and their hypervisors.  Another is the example provided below is the 
rapid create-destroy which can (and often does) leave dangling resources due to 
race conditions between the two operations.  

In order to provide reliability, CloudStack vertically partitions the 
infrastructure into zones (independent power source/network uplink combination) 
sub-divided into pods (racks).  At this time, regions are largely notional, as 
such, as are not partitions at this time.  Between the user’s zone selection 
and our allocators distribution of resources across pods, the system attempts 
to distribute resources widely as possible across these partitions to provide 
resilience against a variety infrastructure failures (e.g. power loss, network 
uplink disruption, switch failures, etc).  In order maximize this resilience, 
the control plane (orchestration + automation tiers) must be to operate on all 
available partitions.  For example, if we have two (2) zones (A & B) and twenty 
(20) pods per zone, we should be able to take and execute work in Zone A when 
one or more pods is lost, as well as, when taking and executing work in Zone B 
when Zone B has failed.

CloudStack is an eventually consistent system in that the state reflected in 
the orchestration tier will (optimistically) differ from the state of the 
underlying infrastructure (managed by the automation tier).  Furthermore, the 
system has a partitioning model to provide resilience in the face of a variety 
of logical and physical failures.  However, the automation control tier 
requires strictly consistent operations.  Based on these definitions, the 
system appears to violate the CAP theorem [2] (Brewer!).  The separation of the 
system into two distinct tiers isolates these characteristics, but the boundary 
between them must be carefully implemented to ensure that the consistency 
requirements of the automation tier are not leaked to the orchestration tier.

To properly implement this boundary, I think we should split the orchestration 
and automation control tiers into separate physical processes communicating via 
an RPC mechanism — allowing the automation control tier to completely 
encapsulate its work distribution model.  In my mind, the tricky wicket is 
providing serialization and partition tolerance in the automation control tier. 
 Realistically, there two options — explicit and implicit locking models.  
Explicit locking models employ an external coordination mechanism to coordinate 
exclusive access to resources (e.g. RDBMS lock pattern, ZooKeeper, Hazelcast, 
etc).  The challenge with this model is ensuring the availability of the 
locking mechanism in the face of partition — forcing CloudStack operators to 
ensure that they have deployed the underlying mechanism in a partition tolerant 
manner (e.g. don’t locate all of the replicas in the same pod, deploy a cluster 
per zone, etc).  Additionally, the durability introduced by these mechanisms 
inhibits the self-healing due to lock staleness.

In contrast, an implicit lock model structures the runtime execution model to 
provide exclusive access to a resource and model the partitioning scheme.  One 
such model is to provide a single work queue (mailbox) and consuming process 
(actor) per resource.  The orchestration tier provides a description of the 
partition and resource definitions to the automation control tier.  The 
automation control tier creates a supervisor per partition which in turn manage 
process creation per resource.  Therefore, process creation