I could give inline answers, but let's not waste to much more time. One point I would like to make is that the live-cycle functions that driver writers implement take care of how (in what state) instances are stopped.
Your point on restricting dependencies is valid and a real concern. And to not end this discussion I would like to refer my previous post; I would love to help on this not withstanding any objection I have on the way to go. It seems like fun to implement:) regards, Daan On Mon, Aug 26, 2013 at 5:13 AM, John Burwell <jburw...@basho.com> wrote: > Daan, > > Please see my responses in-line below. The TL;DR is that I am extremely > skeptical of the complexity and flexibility of OSGi. My experience with it > in practice has not been positive. However, I want to focus on our > requirements for a driver mechanism, and then determine the best > implementation. > > Thanks, > -John > > On Aug 21, 2013, at 4:14 AM, Daan Hoogland <daan.hoogl...@gmail.com> wrote: > >> John, >> >> You do want 'In-process Installation and Upgrade', 'Introspection' and >> 'Discoverability' says that you do want flexibility. You disqualify >> Spring and OSGi on this quality however. > > On the surface, it would appear the OSGi fits into In-process Installation > and Upgrade. However, OSGi assumes a consistency attribute that is too rigid > for CloudStack. As I understand the specification, when a bundle is > upgraded, all instances in the container are upgraded simultaneously. Based > on my reading of it, there is no way to customize this behavior. I think we > need the upgrade process will be eventually consistent where by the > underlying driver instance for a resource will be upgraded when it is both a > consistent and upgradeable state. For example, we have 10,000 KVM hosts, and > the KVM driver is upgraded. 9,000 of them are idle, and can take the upgrade > immediately. The other 1,000 are in some state of operation (creating and > destroying VMs, taking snapshots, etc). For these 1,000, we want to the > upgrade to happen when they complete their current work. Most importantly, > we don't want any work bound for these 10,000 resources during the upgrade to > be lost only delayed. > > When I say discoverability, I mean end-users finding drivers to install. The > more I think about it, the more I explicitly do not want drivers to depend on > each other. Drivers should be self-contained, stateless mechanisms that > interact with some piece of infrastructure. I think the path to madness lies > in having a messy web of cross-vendor driver dependencies. > >> >> If we can restrict the use of bundles to those that adhere to some >> interfaces we prescribe I don't think either complexity nor dependency >> are an issue. > > The only restriction I see is the ability of a bundle to control what is > publicly exported. However, I see no way to restrict how bundles depend on > each other -- opening the door to cross vendor driver dependencies. > >> >> Most every bit of complexity of writing a bundle can be hidden from >> the bundle-developer nowadays. If we can not hide enough it is not an >> option indeed. The main focus of OSGI is life cycle management which >> is exactly what we need. the use that eclipse makes of it is a good >> example not to follow but doesn't disqualify the entire thing. > > Personally, I am dubious that a build process can mask complexity. More > importantly, I don't like creating designs that require tooling and code > generation with a veneer of simplicity but actually create a spooky action at > a distance. I prefer creating truly simple systems that can be easily > comprehended. > >> >> The dependency hell is not different from what we have as regular >> monolithical development group. We control what we package and how. A >> valid point is that some libraries might have issues that prevent them >> from being bundled and that needs investigation. So we need to package >> those libraries as bundles ourselves so 3rd parties don't need to. We >> package them now anyway. > > In my experience, the dependency management problem is magnified by the added > hurdle that every dependency be an OSGi bundle. Many projects do not > natively ship OSGi bundles, leaving third-parties or the project itself to > repackage them. Often OSGi bundled versions are behind the most current > project releases. > >> >> The erector set fear you have is just as valid with as without osgi or >> any existing framework. > > Agreed. I prefer inaction on this topic than create said erector set. > >> >> I don't insist on OSGi and I do agree with your initial set of >> requirements. When I read it I think, "let's use OSGi". And I don't >> see anything but fear of the beast in your arguments against it. Maybe >> your fear is just in my perception or maybe it is very valid. I'm not >> perceptible to it after your reply, yet. > > To my mind, OSGi is a wonderful idea. We need it, or something like it, > standard in the JVM. However, in practice, it is a difficult beast because > it working around limitations in the JVM. When it works, it is awesome until > it breaks or you hit the dependency hell I described. If we adopt it, we > need to ensure it will fit our needs and the functional gain merits taking on > the burden of its risks. > >> >> regards, >> Daan >> >> On Wed, Aug 21, 2013 at 9:00 AM, John Burwell <jburw...@basho.com> wrote: >>> Daan, >>> >>> I have the following issues with OSGi: >>> >>> Complexity: Building OSGi components adds a tremendous amount of complexity >>> to both the building drivers and debugging runtime issues. Additionally, >>> OSGi has a much broader feature set than I think CloudStack needs to >>> support. Therefore, driver authors may use the feature set in unanticipated >>> way that create system instability. >>> Dependency Hell: OSGi requires 3rd party dependencies to be packaged as OSGi >>> bundles. In practice, many third party libraries either have issues that >>> prevent them from being bundles or their OSGi bundled versions are behind >>> mainline release. >>> >>> >>> As an additionally personal experience, I do not want to re-create the mess >>> that is Eclipse (i.e. an erector set with more screws than nuts). In >>> addition to its lack of reliability, it is incredibly difficult to >>> comprehend how the component configurations and relationships are composed >>> at runtime. >>> >>> To be clear, I am not interested in creating a general purpose >>> component/plugin model. Fundamentally, we need a simple, purpose-built >>> component model focused on providing stability and reliability through >>> deterministic behavior rather than feature flexibility. Unfortunately, both >>> OSGi and Spring's focus on flexibility the later make them ill-suited for >>> our purposes. >>> >>> Thanks, >>> -John >>> >>> On Aug 21, 2013, at 2:31 AM, Daan Hoogland <daan.hoogl...@gmail.com> wrote: >>> >>> John, >>> >>> Nice work. >>> Given the maturity of OSGi, I'd say lets see how it fits. One criteria >>> would be can we limit the bundles that may be loaded based on what >>> Cloudstack supports (and not allow loading pydev) if not we need to >>> bake our own. >>> >>> But though I think your work is valuable I disagree on designing our >>> CARs from the get go without having explored usable options in the >>> field first. A new type of YARs is not what the world or cloudstack >>> needs. And given what you have written the main problem wll be finding >>> a framework we can restrict to what we want, not one that can do all >>> of it. >>> >>> done shooting, >>> Daan >>> >>> On Wed, Aug 21, 2013 at 2:52 AM, Darren Shepherd >>> <darren.s.sheph...@gmail.com> wrote: >>> >>> Sure, I fully understand how it theoretically works, but I'm saying from a >>> practical perspective it always seems to fall apart. What your describing >>> is done excellently in OSGI 4.2 Blueprint. It's a beautiful framework that >>> allows you to expose services that can be dynamically updated at runtime. >>> >>> The issues always happens with unloading. I'll give you a real world >>> example. As part of the servlet spec your supposed to be able to stop and >>> unload wars. But in practice if you do it enough times you typically run >>> out of memory. So one such issue was with commons logging (since fixed). >>> When you do getLogger(myclass.class) it would cache a reference of the Class >>> object to the actual log impl. The commons logging jar is typically loaded >>> with a system classloader and but MyClass.class would be loaded in the >>> webapp classloader. So when you stop the war there is a reference chain >>> system classloader -> logfactory -> Myclass -> webapp classloader. So the >>> web app never gets GC'd. >>> >>> So just pointing out the practical issues, that's it. >>> >>> Darren >>> >>> On Aug 20, 2013, at 5:31 PM, John Burwell <jburw...@basho.com> wrote: >>> >>> Darren, >>> >>> Actually, loading and unloading aren't difficult if resource management and >>> drivers work within the following constraints/assumptions: >>> >>> Drivers are transient and stateless >>> A driver instance is assigned per resource managed (i.e. no singletons) >>> A lightweight thread and mailbox (i.e. actor model) are assigned per >>> resource managed (outlined in the presentation referenced below) >>> >>> >>> Based on these constraints and assumptions, the following upgrade process >>> could be implemented: >>> >>> Load and verify new driver version to make it available >>> Notify the supervisor processes of each affected resource that a new driver >>> is available >>> Upon completion of the current message being processed by its associated >>> actor, the supervisor kills and respawns the actor managing its associated >>> resource >>> As part of startup, the supervisor injects an instance of the new driver >>> version and the actor resumes processing messages in its mailbox >>> >>> >>> This process mirrors the process that would occur on management server >>> startup for each resource minus killing an existing actor instance. >>> Eventually, the system will upgrade the driver without loss of operation. >>> More sophisticated policies could be added, but I think this approach would >>> be a solid default upgrade behavior. As a bonus, this same approach could >>> also be applied to global configuration settings -- allowing the system to >>> apply changes to these values without restarting the system. >>> >>> In summary, CloudStack and Eclipse are very different types of systems. >>> Eclipse is a desktop application implementing complex workflows, user >>> interactions, and management of shared state (e.g. project structure, AST, >>> compiler status, etc). In contrast, CloudStack is an eventually consistent >>> distributed system performing automation control. As such, its requirements >>> plugin requirements are not only very different, but IMHO, much simpler. >>> >>> Thanks, >>> -John >>> >>> On Aug 20, 2013, at 7:44 PM, Darren Shepherd <darren.s.sheph...@gmail.com> >>> wrote: >>> >>> I know this isn't terribly useful, but I've been drawing a lot of squares >>> and circles and lines that connect those squares and circles lately and I >>> have a lot of architectural ideas for CloudStack. At the rate I'm going it >>> will take me about two weeks to put together a discussion/proposal for the >>> community. What I'm thinking is a superset of what you've listed out and >>> should align with your idea of a CAR. The focus has a a lot to do with >>> modularity and extensibility. >>> >>> So more to come soon.... I will say one thing though, is with java you end >>> up having a hard time doing dynamic load and unloading of modules. There's >>> plenty of frameworks that try really hard to do this right, like OSGI, but >>> its darn near impossible to do it right because of class loading and GC >>> issues (and that's why Eclipse has you restart after installing plugs even >>> though it is OSGi). >>> >>> I do believe that CloudStack should be possible of zero downtime maintenance >>> and have ideas around that, but at the end of the day, for plenty of >>> practical reasons, you still need a JVM restart if modules change. >>> >>> Darren >>> >>> On Aug 20, 2013, at 3:39 PM, Mike Tutkowski <mike.tutkow...@solidfire.com> >>> wrote: >>> >>> I agree, John - let's get consensus first, then talk time tables. >>> >>> >>> On Tue, Aug 20, 2013 at 4:31 PM, John Burwell <jburw...@basho.com> wrote: >>> >>> Mike, >>> >>> Before we can dig into timelines or implementations, I think we need to >>> get consensus on the problem to solved and the goals. Once we have a >>> proper understanding of the scope, I believe we can chunk the across a set >>> of development lifecycle. The subject is vast, but it also has a far >>> reaching impact to both the storage and network layer evolution efforts. >>> As such, I believe we need to start addressing it as part of the next >>> release. >>> >>> As a separate thread, we need to discuss the timeline for the next >>> release. I think we need to avoid the time compression caused by the >>> overlap of the 4.1 stabilization effort and 4.2 development. Therefore, I >>> don't think we should consider development of the next release started >>> until the first 4.2 RC is released. I will try to open a separate discuss >>> thread for this topic, as well as, tying of the discussion of release code >>> names. >>> >>> Thanks, >>> -John >>> >>> On Aug 20, 2013, at 6:22 PM, Mike Tutkowski <mike.tutkow...@solidfire.com> >>> wrote: >>> >>> Hey John, >>> >>> I think this is some great stuff. Thanks for the write up. >>> >>> It looks like you have ideas around what might go into a first release of >>> this plug-in framework. Were you thinking we'd have enough time to >>> >>> squeeze >>> >>> that first rev into 4.3. I'm just wondering (it's not a huge deal to hit >>> that release for this) because we would only have about five weeks. >>> >>> Thanks >>> >>> >>> On Tue, Aug 20, 2013 at 3:43 PM, John Burwell <jburw...@basho.com> >>> >>> wrote: >>> >>> >>> All, >>> >>> In capturing my thoughts on storage, my thinking backed into the driver >>> model. While we have the beginnings of such a model today, I see the >>> following deficiencies: >>> >>> >>> 1. *Multiple Models*: The Storage, Hypervisor, and Security layers >>> each have a slightly different model for allowing system >>> >>> functionality to >>> >>> be extended/substituted. These differences increase the barrier of >>> >>> entry >>> >>> for vendors seeking to extend CloudStack and accrete code paths to be >>> maintained and verified. >>> 2. *Leaky Abstraction*: Plugins are registered through a Spring >>> configuration file. In addition to being operator unfriendly (most >>> sysadmins are not Spring experts nor do they want to be), we expose >>> >>> the >>> >>> core bootstrapping mechanism to operators. Therefore, a >>> >>> misconfiguration >>> >>> could negatively impact the injection/configuration of internal >>> >>> management >>> >>> server components. Essentially handing them a loaded shotgun pointed >>> >>> at >>> >>> our right foot. >>> 3. *Nondeterministic Load/Unload Model*: Because the core loading >>> mechanism is Spring, the management has little control over the >>> >>> timing and >>> >>> order of component loading/unloading. Changes to the Management >>> >>> Server's >>> >>> component dependency graph could break a driver by causing it to be >>> >>> started >>> >>> at an unexpected time. >>> 4. *Lack of Execution Isolation*: As a Spring component, plugins are >>> loaded into the same execution context as core management server >>> components. Therefore, an errant plugin can corrupt the entire >>> >>> management >>> >>> server. >>> >>> >>> For next revision of the plugin/driver mechanism, I would like see us >>> migrate towards a standard pluggable driver model that supports all of >>> >>> the >>> >>> management server's extension points (e.g. network devices, storage >>> devices, hypervisors, etc) with the following capabilities: >>> >>> >>> - *Consolidated Lifecycle and Startup Procedure*: Drivers share a >>> common state machine and categorization (e.g. network, storage, >>> >>> hypervisor, >>> >>> etc) that permits the deterministic calculation of initialization and >>> destruction order (i.e. network layer drivers -> storage layer >>> >>> drivers -> >>> >>> hypervisor drivers). Plugin inter-dependencies would be supported >>> >>> between >>> >>> plugins sharing the same category. >>> - *In-process Installation and Upgrade*: Adding or upgrading a driver >>> does not require the management server to be restarted. This >>> >>> capability >>> >>> implies a system that supports the simultaneous execution of multiple >>> driver versions and the ability to suspend continued execution work >>> >>> on a >>> >>> resource while the underlying driver instance is replaced. >>> - *Execution Isolation*: The deployment packaging and execution >>> environment supports different (and potentially conflicting) versions >>> >>> of >>> >>> dependencies to be simultaneously used. Additionally, plugins would >>> >>> be >>> >>> sufficiently sandboxed to protect the management server against driver >>> instability. >>> - *Extension Data Model*: Drivers provide a property bag with a >>> metadata descriptor to validate and render vendor specific data. The >>> contents of this property bag will provided to every driver operation >>> invocation at runtime. The metadata descriptor would be a lightweight >>> description that provides a label resource key, a description >>> >>> resource key, >>> >>> data type (string, date, number, boolean), required flag, and optional >>> length limit. >>> - *Introspection: Administrative APIs/UIs allow operators to >>> understand the configuration of the drivers in the system, their >>> configuration, and their current state.* >>> - *Discoverability*: Optionally, drivers can be discovered via a >>> project repository definition (similar to Yum) allowing drivers to be >>> remotely acquired and operators to be notified regarding update >>> availability. The project would also provide, free of charge, >>> >>> certificates >>> >>> to sign plugins. This mechanism would support local mirroring to >>> >>> support >>> >>> air gapped management networks. >>> >>> >>> Fundamentally, I do not want to turn CloudStack into an erector set with >>> more screws than nuts which is a risk with highly pluggable >>> >>> architectures. >>> >>> As such, I think we would need to tightly bound the scope of drivers and >>> their behaviors to prevent the loss system usability and stability. My >>> thinking is that drivers would be packaged into a custom JAR, CAR >>> (CloudStack ARchive), that would be structured as followed: >>> >>> >>> - META-INF >>> - MANIFEST.MF >>> - driver.yaml (driver metadata(e.g. version, name, description, >>> etc) serialized in YAML format) >>> - LICENSE (a text file containing the driver's license) >>> - lib (driver dependencies) >>> - classes (driver implementation) >>> - resources (driver message files and potentially JS resources) >>> >>> >>> The management server would acquire drivers through a simple scan of a >>> >>> URL >>> >>> (e.g. file directory, S3 bucket, etc). For every CAR object found, the >>> management server would create an execution environment (likely a >>> >>> dedicated >>> >>> ExecutorService and Classloader), and transition the state of the >>> >>> driver to >>> >>> Running (the exact state model would need to be worked out). To be >>> >>> really >>> >>> nice, we could develop a custom Ant task/Maven plugin/Gradle plugin to >>> create CARs. I can also imagine an opportunities to add hooks to this >>> model to register instrumentation information with JMX and >>> >>> authorization. >>> >>> >>> To keep the scope of this email confined, we would introduce the general >>> notion of a Resource, and (hand wave hand wave) eventually >>> >>> compartmentalize >>> >>> the execution of work around a resource [1]. This (hand waved) >>> compartmentalization would allow us the controls necessary to safely and >>> reliably perform in-place driver upgrades. For an initial release, I >>> >>> would >>> >>> recommend implementing the abstractions, loading mechanism, extension >>> >>> data >>> >>> model, and discovery features. With these capabilities in place, we >>> >>> could >>> >>> attack the in-place upgrade model. >>> >>> If we were to adopt such a pluggable capability, we would have the >>> opportunity to decouple the vendor and CloudStack release schedules. >>> >>> For >>> >>> example, if a vendor were introducing a new product that required a new >>> >>> or >>> >>> updated driver, they would no longer need to wait for a CloudStack >>> >>> release >>> >>> to support it. They would also gain the ability to fix high priority >>> defects in the same manner. >>> >>> I have hand waved a number of issues that would need to be resolved >>> >>> before >>> >>> such an approach could be implemented. However, I think we need to >>> >>> decide, >>> >>> as a community, that it worth devoting energy and effort to enhancing >>> >>> the >>> >>> plugin/driver model and the goals of that effort before driving head >>> >>> first >>> >>> into the deep rabbit hole of design/implementation. >>> >>> Thoughts? (/me ducks) >>> -John >>> >>> [1]: My opinions on the matter from CloudStack Collab 2013 -> >>> >>> http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-stack-distributed-process-management >>> >>> >>> >>> >>> -- >>> *Mike Tutkowski* >>> *Senior CloudStack Developer, SolidFire Inc.* >>> e: mike.tutkow...@solidfire.com >>> o: 303.746.7302 >>> Advancing the way the world uses the >>> cloud<http://solidfire.com/solution/overview/?video=play> >>> *™* >>> >>> >>> >>> -- >>> *Mike Tutkowski* >>> *Senior CloudStack Developer, SolidFire Inc.* >>> e: mike.tutkow...@solidfire.com >>> o: 303.746.7302 >>> Advancing the way the world uses the >>> cloud<http://solidfire.com/solution/overview/?video=play> >>> *™* >>> >>> >>> >