Okay, I'll have to stew over this for a bit. My one general comment is that it seems complicated. Such a system seems like it would take a good amount of effort to construct properly and as such it's a risky endeavour.
Darren On Mon, Nov 25, 2013 at 12:10 PM, John Burwell <jburw...@basho.com> wrote: > Darren, > > In a peer-to-peer model such as I describe, active-active is and is not a > concept. The supervision tree is responsible for identifying failure, and > initiating process re-allocation for failed resources. For example, if a > pod’s management process crashed, it would also crash all of the processes > managing the hosts in that pod. The zone would then attempt to restart the > pod’s management process (either local to the zone supervisor or on a > remote instance which could be configurable) until it was able to start > “ready” process for the child resource. > > This model requires a “special” root supervisor that is controlled by the > orchestration tier which can identify when a zone supervisor becomes > unavailable, and attempts to restart it. The ownership of this “special” > supervisor will require a consensus mechanism amongst the orchestration > tier processes to elect an owner of the process and determine when a new > owner needs to be elected (e.g. a Raft implementation such as barge [1]). > Given the orchestration tier is designed as an AP system, an orchestration > tier process should be able to be an owner (i.e. the operator is not > required to identify a “master” node). There are likely other potential > topologies (e.g. a root supervisor per zone rather than one for all zones), > but in all cases ownership election would be the same. Most importantly, > there are no data durability requirements in this claim model. When an > orchestration process becomes unable to continue owning a root supervisor, > the other orchestration processes recognize the missing owner and initiate > ownership claim the process for the partition. > > In all failure scenarios, the supervision tree must be rebuilt from the > point of failure downward using the process allocation process I previously > described. For an initial implementation, I would recommend taking simply > throwing any parts of the supervision tree that are already running in the > event of a widespread failure (e.g. a zone with many pods). Dependent on > the recovery time and SLAs, a future optimization may be to re-attach > “orphaned” branches of the previous tree to the tree being built as part of > the recovery process (e.g. loss a zone supervisor due to a switch failure). > Additionally, the system would also need a mechanism to hand-off ownership > of the root supervisor for planned outages (hardware > upgrades/decommissioning, maintenance windows, etc). > > Again, caveated with a a few hand waves, the idea is to build up a > peer-to-peer management model that provides strict serialization > guarantees. Fundamentally, it utilizes a tree of processes to provide > exclusive access, distribute work, and ensure availability requirements > when partitions occur. Details would need to be worked out for the best > application to CloudStack (e.g root node ownership and orchestration tier > gossip), but we would be implementing well-trod distributed systems > concepts in the context cloud orchestration (sounds like a fun thing to do > …). > > Thanks, > -John > > [1]: https://github.com/mgodave/barge > > P.S. I see the libraries/frameworks referenced as the building blocks to a > solution, but none of them (in whole or combination) solves the problem > completely. > > On Nov 25, 2013, at 12:29 PM, Darren Shepherd <darren.s.sheph...@gmail.com> > wrote: > > I will ask one basic question. How do you forsee managing one mailbox per > resource. If I have multiple servers running in an active-active mode, how > do you determine which server has the mailbox? Do you create actors on > demand? How do you synchronize that operation? > > Darren > > > On Mon, Nov 25, 2013 at 10:16 AM, Darren Shepherd < > darren.s.sheph...@gmail.com> wrote: > >> You bring up some interesting points. I really need to digest this >> further. From a high level I think I agree, but there are a lot of implied >> details of what you've said. >> >> Darren >> >> >> On Mon, Nov 25, 2013 at 8:39 AM, John Burwell <jburw...@basho.com> wrote: >> >>> Darren, >>> >>> I originally presented my thoughts on this subject at CCC13 [1]. >>> Fundamentally, I see CloudStack as having two distinct tiers — >>> orchestration management and automation control. The orchestration tier >>> coordinates the automation control layer to fulfill user goals (e.g. create >>> a VM instance, alter a network route, snapshot a volume, etc) constrained >>> by policies defined by the operator (e.g. multi-tenacy boundaries, ACLs, >>> quotas, etc). This layer must always be available to take new requests, >>> and to report the best available infrastructure state information. Since >>> execution of work is guaranteed on completion of a request, this layer may >>> pend work to be completed when the appropriate devices become available. >>> >>> The automation control tier translates logical units of work to >>> underlying infrastructure component APIs. Upon completion of unit of >>> work’s execution, the state of a device (e.g. hypervisor, storage device, >>> network switch, router, etc) matches the state managed by the orchestration >>> tier at the time unit of work was created. In order to ensure that the >>> state of the underlying devices remains consistent, these units of work >>> must be executed serially. Permitting concurrent changes to resources >>> creates race conditions that lead to resource overcommitment and state >>> divergence. A symptom of this phenomenon are the myriad of scripts >>> operators write to “synchronize” state between the CloudStack database and >>> their hypervisors. Another is the example provided below is the rapid >>> create-destroy which can (and often does) leave dangling resources due to >>> race conditions between the two operations. >>> >>> In order to provide reliability, CloudStack vertically partitions the >>> infrastructure into zones (independent power source/network uplink >>> combination) sub-divided into pods (racks). At this time, regions are >>> largely notional, as such, as are not partitions at this time. Between the >>> user’s zone selection and our allocators distribution of resources across >>> pods, the system attempts to distribute resources widely as possible across >>> these partitions to provide resilience against a variety infrastructure >>> failures (e.g. power loss, network uplink disruption, switch failures, >>> etc). In order maximize this resilience, the control plane (orchestration >>> + automation tiers) must be to operate on all available partitions. For >>> example, if we have two (2) zones (A & B) and twenty (20) pods per zone, we >>> should be able to take and execute work in Zone A when one or more pods is >>> lost, as well as, when taking and executing work in Zone B when Zone B has >>> failed. >>> >>> CloudStack is an eventually consistent system in that the state >>> reflected in the orchestration tier will (optimistically) differ from the >>> state of the underlying infrastructure (managed by the automation tier). >>> Furthermore, the system has a partitioning model to provide resilience in >>> the face of a variety of logical and physical failures. However, the >>> automation control tier requires strictly consistent operations. Based on >>> these definitions, the system appears to violate the CAP theorem [2] >>> (Brewer!). The separation of the system into two distinct tiers isolates >>> these characteristics, but the boundary between them must be carefully >>> implemented to ensure that the consistency requirements of the automation >>> tier are not leaked to the orchestration tier. >>> >>> To properly implement this boundary, I think we should split the >>> orchestration and automation control tiers into separate physical processes >>> communicating via an RPC mechanism — allowing the automation control tier >>> to completely encapsulate its work distribution model. In my mind, the >>> tricky wicket is providing serialization and partition tolerance in the >>> automation control tier. Realistically, there two options — explicit and >>> implicit locking models. Explicit locking models employ an external >>> coordination mechanism to coordinate exclusive access to resources (e.g. >>> RDBMS lock pattern, ZooKeeper, Hazelcast, etc). The challenge with this >>> model is ensuring the availability of the locking mechanism in the face of >>> partition — forcing CloudStack operators to ensure that they have deployed >>> the underlying mechanism in a partition tolerant manner (e.g. don’t locate >>> all of the replicas in the same pod, deploy a cluster per zone, etc). >>> Additionally, the durability introduced by these mechanisms inhibits the >>> self-healing due to lock staleness. >>> >>> In contrast, an implicit lock model structures the runtime execution >>> model to provide exclusive access to a resource and model the partitioning >>> scheme. One such model is to provide a single work queue (mailbox) and >>> consuming process (actor) per resource. The orchestration tier provides a >>> description of the partition and resource definitions to the automation >>> control tier. The automation control tier creates a supervisor per >>> partition which in turn manage process creation per resource. Therefore, >>> process creation and destruction creates an implicit lock. Since >>> automation control tier does not persist data in this model, The crash of >>> a supervisor and/or process (supervisors are simply specialized processes) >>> releases the implicit lock, and signals a re-execution of the >>> supervisor/process allocation process. The following high-level process >>> describes creation allocation (hand waves certain details such as back >>> pressure and throttling): >>> >>> >>> 1. The automation control layer receives a resource definition (e.g. >>> zone description, VM definition, volume information, etc). These >>> requests >>> are processed by the owning partition supervisor exclusively in order of >>> receipt. Therefore, the automation control tier views the world as a >>> tree >>> of partitions and resources. >>> 2. The partition supervisor creates the process (and the associated >>> mailbox) — providing it with the initial state. The process state is >>> Initialized. >>> 3. The process synchronizes the state of the underlying resource >>> with the state provided. Upon successful completion of state >>> synchronization, the state of the process becomes Ready. Only Ready >>> processes can consume units of work from their mailboxes. The processes >>> crashes. All state transitions and crashes are reported to interested >>> parties through an asynchronous event reporting mechanism including the >>> id >>> of the unit of work the device represents. >>> >>> >>> The Ready state means that the underlying device is in a useable state >>> consistent with the last unit of work executed. A process crashes when it >>> is unable to bring the device into a state consistent with the unit of work >>> being executed (a process crash also destroys the associated mailbox — >>> flushing pending work). This event initiates execution of allocation >>> process (above) until the process can be re-allocated in a Ready state >>> (again throttling is hand waved for the purposes of brevity). The state >>> synchronization step converges the actual state of the device with changes >>> that occurred during unavailability. When a unit of work fails to be >>> executed, the orchestration tier determines the appropriate recovery >>> strategy (e.g. re-allocate work to another resource, wait for the >>> availability of the resource, fail the operation, etc). >>> >>> The association of one process per resource provides exclusive access to >>> the resource without the requirement of an external locking mechanism. A >>> mailbox per process provides orders pending units of work. Together, they >>> provide serialization of operation execution. In the example provided, a >>> unit of work would be submitted to create a VM and a second unit of work >>> would be submitted to destroy it. The creation would be completely >>> executed followed by the destruction (assuming no failures). Therefore, >>> the VM will briefly exist before being destroyed. In conduction with a >>> process location mechanism, the system can place the processes associated >>> with resources in the appropriate partition allowing the system properly >>> self heal, manage its own scalability (thinking lightweight system VMs), >>> and systematically enforce partition tolerance (the operator was nice >>> enough to describe their infrastructure — we should use it to ensure >>> resilience of CloudStack and their infrastructure). >>> >>> Until relatively recently, the implicit locking model described was >>> infeasible on the JVM. Using native Java threads, a server would be >>> limited to controlling (at best) a few hundred resources. However, >>> lightweight threading models implemented by libraries/frameworks such as >>> Akka [3], Quasar [4], and Erjang [5] can scale to millions of “threads” on >>> reasonability sized servers and provide the supervisor/actor/mailbox >>> abstractions described above. Most importantly, this approach does not >>> require operators to become operationally knowledgeable of yet another >>> platform/component. In short, I believe we can encapsulate these >>> requirements in the management server (orchestration + automation control >>> tiers) — keeping the operational footprint of the system proportional to >>> the deployment without sacrificing resilience. Finally, it provides the >>> foundation for proper collection of instrumentation information and process >>> control/monitoring across data centers. >>> >>> Admittedly, I have hand waved some significant issues that would beed to >>> be resolved. I believe they are all resolvable, but it would take >>> discussion to determine the best approach to them. Transforming CloudStack >>> to such a model would not be trivial, but I believe it would be worth the >>> (significant) effort as it would make CloudStack one of the most scalable >>> and resilient cloud orchestration/management platforms available. >>> >>> Thanks, >>> -John >>> >>> [1]: >>> http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-stack-distributed-process-management >>> [2]: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf >>> [3]: http://akka.io >>> [4]: https://github.com/puniverse/quasar >>> [5]: https://github.com/trifork/erjang/wiki >>> >>> P.S. I have CC’ed the developer mailing list. All conversations at >>> this level of detail should be initiated and occur on the mailing list to >>> ensure transparency with the community. >>> >>> On Nov 22, 2013, at 3:49 PM, Darren Shepherd < >>> darren.s.sheph...@gmail.com> wrote: >>> >>> 140 characters are not productive. >>> >>> What would be your idea way to do distributed concurrency control? >>> Simple use case. Server 1 receives a request to start a VM 1. Server 2 >>> receives a request to delete VM 1. What do you do? >>> >>> Darren >>> >>> >>> >> > >