Darren, My response does hand wave two important issues -- hot code reloading and PermGen leakage. These are tricky, but well trod issues that can be solved in variety of ways (e.g. instrumentation, class loaders, OSGi). It would require a some research/experimentation to determine the best approach particularly when using a lightweight threading model.
Thanks, -John On Aug 20, 2013, at 8:31 PM, John Burwell <[email protected]> wrote: > Darren, > > Actually, loading and unloading aren't difficult if resource management and > drivers work within the following constraints/assumptions: > > Drivers are transient and stateless > A driver instance is assigned per resource managed (i.e. no singletons) > A lightweight thread and mailbox (i.e. actor model) are assigned per resource > managed (outlined in the presentation referenced below) > > Based on these constraints and assumptions, the following upgrade process > could be implemented: > > Load and verify new driver version to make it available > Notify the supervisor processes of each affected resource that a new driver > is available > Upon completion of the current message being processed by its associated > actor, the supervisor kills and respawns the actor managing its associated > resource > As part of startup, the supervisor injects an instance of the new driver > version and the actor resumes processing messages in its mailbox > > This process mirrors the process that would occur on management server > startup for each resource minus killing an existing actor instance. > Eventually, the system will upgrade the driver without loss of operation. > More sophisticated policies could be added, but I think this approach would > be a solid default upgrade behavior. As a bonus, this same approach could > also be applied to global configuration settings -- allowing the system to > apply changes to these values without restarting the system. > > In summary, CloudStack and Eclipse are very different types of systems. > Eclipse is a desktop application implementing complex workflows, user > interactions, and management of shared state (e.g. project structure, AST, > compiler status, etc). In contrast, CloudStack is an eventually consistent > distributed system performing automation control. As such, its requirements > plugin requirements are not only very different, but IMHO, much simpler. > > Thanks, > -John > > On Aug 20, 2013, at 7:44 PM, Darren Shepherd <[email protected]> > wrote: > >> I know this isn't terribly useful, but I've been drawing a lot of squares >> and circles and lines that connect those squares and circles lately and I >> have a lot of architectural ideas for CloudStack. At the rate I'm going it >> will take me about two weeks to put together a discussion/proposal for the >> community. What I'm thinking is a superset of what you've listed out and >> should align with your idea of a CAR. The focus has a a lot to do with >> modularity and extensibility. >> >> So more to come soon.... I will say one thing though, is with java you end >> up having a hard time doing dynamic load and unloading of modules. There's >> plenty of frameworks that try really hard to do this right, like OSGI, but >> its darn near impossible to do it right because of class loading and GC >> issues (and that's why Eclipse has you restart after installing plugs even >> though it is OSGi). >> >> I do believe that CloudStack should be possible of zero downtime maintenance >> and have ideas around that, but at the end of the day, for plenty of >> practical reasons, you still need a JVM restart if modules change. >> >> Darren >> >> On Aug 20, 2013, at 3:39 PM, Mike Tutkowski <[email protected]> >> wrote: >> >>> I agree, John - let's get consensus first, then talk time tables. >>> >>> >>> On Tue, Aug 20, 2013 at 4:31 PM, John Burwell <[email protected]> wrote: >>> >>>> Mike, >>>> >>>> Before we can dig into timelines or implementations, I think we need to >>>> get consensus on the problem to solved and the goals. Once we have a >>>> proper understanding of the scope, I believe we can chunk the across a set >>>> of development lifecycle. The subject is vast, but it also has a far >>>> reaching impact to both the storage and network layer evolution efforts. >>>> As such, I believe we need to start addressing it as part of the next >>>> release. >>>> >>>> As a separate thread, we need to discuss the timeline for the next >>>> release. I think we need to avoid the time compression caused by the >>>> overlap of the 4.1 stabilization effort and 4.2 development. Therefore, I >>>> don't think we should consider development of the next release started >>>> until the first 4.2 RC is released. I will try to open a separate discuss >>>> thread for this topic, as well as, tying of the discussion of release code >>>> names. >>>> >>>> Thanks, >>>> -John >>>> >>>> On Aug 20, 2013, at 6:22 PM, Mike Tutkowski <[email protected]> >>>> wrote: >>>> >>>>> Hey John, >>>>> >>>>> I think this is some great stuff. Thanks for the write up. >>>>> >>>>> It looks like you have ideas around what might go into a first release of >>>>> this plug-in framework. Were you thinking we'd have enough time to >>>> squeeze >>>>> that first rev into 4.3. I'm just wondering (it's not a huge deal to hit >>>>> that release for this) because we would only have about five weeks. >>>>> >>>>> Thanks >>>>> >>>>> >>>>> On Tue, Aug 20, 2013 at 3:43 PM, John Burwell <[email protected]> >>>> wrote: >>>>> >>>>>> All, >>>>>> >>>>>> In capturing my thoughts on storage, my thinking backed into the driver >>>>>> model. While we have the beginnings of such a model today, I see the >>>>>> following deficiencies: >>>>>> >>>>>> >>>>>> 1. *Multiple Models*: The Storage, Hypervisor, and Security layers >>>>>> each have a slightly different model for allowing system >>>> functionality to >>>>>> be extended/substituted. These differences increase the barrier of >>>> entry >>>>>> for vendors seeking to extend CloudStack and accrete code paths to be >>>>>> maintained and verified. >>>>>> 2. *Leaky Abstraction*: Plugins are registered through a Spring >>>>>> configuration file. In addition to being operator unfriendly (most >>>>>> sysadmins are not Spring experts nor do they want to be), we expose >>>> the >>>>>> core bootstrapping mechanism to operators. Therefore, a >>>> misconfiguration >>>>>> could negatively impact the injection/configuration of internal >>>> management >>>>>> server components. Essentially handing them a loaded shotgun pointed >>>> at >>>>>> our right foot. >>>>>> 3. *Nondeterministic Load/Unload Model*: Because the core loading >>>>>> mechanism is Spring, the management has little control over the >>>> timing and >>>>>> order of component loading/unloading. Changes to the Management >>>> Server's >>>>>> component dependency graph could break a driver by causing it to be >>>> started >>>>>> at an unexpected time. >>>>>> 4. *Lack of Execution Isolation*: As a Spring component, plugins are >>>>>> loaded into the same execution context as core management server >>>>>> components. Therefore, an errant plugin can corrupt the entire >>>> management >>>>>> server. >>>>>> >>>>>> >>>>>> For next revision of the plugin/driver mechanism, I would like see us >>>>>> migrate towards a standard pluggable driver model that supports all of >>>> the >>>>>> management server's extension points (e.g. network devices, storage >>>>>> devices, hypervisors, etc) with the following capabilities: >>>>>> >>>>>> >>>>>> - *Consolidated Lifecycle and Startup Procedure*: Drivers share a >>>>>> common state machine and categorization (e.g. network, storage, >>>> hypervisor, >>>>>> etc) that permits the deterministic calculation of initialization and >>>>>> destruction order (i.e. network layer drivers -> storage layer >>>> drivers -> >>>>>> hypervisor drivers). Plugin inter-dependencies would be supported >>>> between >>>>>> plugins sharing the same category. >>>>>> - *In-process Installation and Upgrade*: Adding or upgrading a driver >>>>>> does not require the management server to be restarted. This >>>> capability >>>>>> implies a system that supports the simultaneous execution of multiple >>>>>> driver versions and the ability to suspend continued execution work >>>> on a >>>>>> resource while the underlying driver instance is replaced. >>>>>> - *Execution Isolation*: The deployment packaging and execution >>>>>> environment supports different (and potentially conflicting) versions >>>> of >>>>>> dependencies to be simultaneously used. Additionally, plugins would >>>> be >>>>>> sufficiently sandboxed to protect the management server against driver >>>>>> instability. >>>>>> - *Extension Data Model*: Drivers provide a property bag with a >>>>>> metadata descriptor to validate and render vendor specific data. The >>>>>> contents of this property bag will provided to every driver operation >>>>>> invocation at runtime. The metadata descriptor would be a lightweight >>>>>> description that provides a label resource key, a description >>>> resource key, >>>>>> data type (string, date, number, boolean), required flag, and optional >>>>>> length limit. >>>>>> - *Introspection: Administrative APIs/UIs allow operators to >>>>>> understand the configuration of the drivers in the system, their >>>>>> configuration, and their current state.* >>>>>> - *Discoverability*: Optionally, drivers can be discovered via a >>>>>> project repository definition (similar to Yum) allowing drivers to be >>>>>> remotely acquired and operators to be notified regarding update >>>>>> availability. The project would also provide, free of charge, >>>> certificates >>>>>> to sign plugins. This mechanism would support local mirroring to >>>> support >>>>>> air gapped management networks. >>>>>> >>>>>> >>>>>> Fundamentally, I do not want to turn CloudStack into an erector set with >>>>>> more screws than nuts which is a risk with highly pluggable >>>> architectures. >>>>>> As such, I think we would need to tightly bound the scope of drivers and >>>>>> their behaviors to prevent the loss system usability and stability. My >>>>>> thinking is that drivers would be packaged into a custom JAR, CAR >>>>>> (CloudStack ARchive), that would be structured as followed: >>>>>> >>>>>> >>>>>> - META-INF >>>>>> - MANIFEST.MF >>>>>> - driver.yaml (driver metadata(e.g. version, name, description, >>>>>> etc) serialized in YAML format) >>>>>> - LICENSE (a text file containing the driver's license) >>>>>> - lib (driver dependencies) >>>>>> - classes (driver implementation) >>>>>> - resources (driver message files and potentially JS resources) >>>>>> >>>>>> >>>>>> The management server would acquire drivers through a simple scan of a >>>> URL >>>>>> (e.g. file directory, S3 bucket, etc). For every CAR object found, the >>>>>> management server would create an execution environment (likely a >>>> dedicated >>>>>> ExecutorService and Classloader), and transition the state of the >>>> driver to >>>>>> Running (the exact state model would need to be worked out). To be >>>> really >>>>>> nice, we could develop a custom Ant task/Maven plugin/Gradle plugin to >>>>>> create CARs. I can also imagine an opportunities to add hooks to this >>>>>> model to register instrumentation information with JMX and >>>> authorization. >>>>>> >>>>>> To keep the scope of this email confined, we would introduce the general >>>>>> notion of a Resource, and (hand wave hand wave) eventually >>>> compartmentalize >>>>>> the execution of work around a resource [1]. This (hand waved) >>>>>> compartmentalization would allow us the controls necessary to safely and >>>>>> reliably perform in-place driver upgrades. For an initial release, I >>>> would >>>>>> recommend implementing the abstractions, loading mechanism, extension >>>> data >>>>>> model, and discovery features. With these capabilities in place, we >>>> could >>>>>> attack the in-place upgrade model. >>>>>> >>>>>> If we were to adopt such a pluggable capability, we would have the >>>>>> opportunity to decouple the vendor and CloudStack release schedules. >>>> For >>>>>> example, if a vendor were introducing a new product that required a new >>>> or >>>>>> updated driver, they would no longer need to wait for a CloudStack >>>> release >>>>>> to support it. They would also gain the ability to fix high priority >>>>>> defects in the same manner. >>>>>> >>>>>> I have hand waved a number of issues that would need to be resolved >>>> before >>>>>> such an approach could be implemented. However, I think we need to >>>> decide, >>>>>> as a community, that it worth devoting energy and effort to enhancing >>>> the >>>>>> plugin/driver model and the goals of that effort before driving head >>>> first >>>>>> into the deep rabbit hole of design/implementation. >>>>>> >>>>>> Thoughts? (/me ducks) >>>>>> -John >>>>>> >>>>>> [1]: My opinions on the matter from CloudStack Collab 2013 -> >>>> http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-stack-distributed-process-management >>>>> >>>>> >>>>> >>>>> -- >>>>> *Mike Tutkowski* >>>>> *Senior CloudStack Developer, SolidFire Inc.* >>>>> e: [email protected] >>>>> o: 303.746.7302 >>>>> Advancing the way the world uses the >>>>> cloud<http://solidfire.com/solution/overview/?video=play> >>>>> *™* >>> >>> >>> -- >>> *Mike Tutkowski* >>> *Senior CloudStack Developer, SolidFire Inc.* >>> e: [email protected] >>> o: 303.746.7302 >>> Advancing the way the world uses the >>> cloud<http://solidfire.com/solution/overview/?video=play> >>> *™* >
signature.asc
Description: Message signed with OpenPGP using GPGMail
