Re: [DISCUSS/PROPOSAL] Upgrading Driver Model

Daan Hoogland Mon, 26 Aug 2013 00:42:18 -0700

I could give inline answers, but let's not waste to much more time.
One point I would like to make is that the live-cycle functions that
driver writers implement take care of how (in what state) instances
are stopped.


Your point on restricting dependencies is valid and a real concern.

And to not end this discussion I would like to refer my previous post;
I would love to help on this not withstanding any objection I have on
the way to go. It seems like fun to implement:)

regards,
Daan


On Mon, Aug 26, 2013 at 5:13 AM, John Burwell <jburw...@basho.com> wrote:
> Daan,
>
> Please see my responses in-line below.  The TL;DR is that I am extremely 
> skeptical of the complexity and flexibility of OSGi.  My experience with it 
> in practice has not been positive.  However, I want to focus on our 
> requirements for a driver mechanism, and then determine the best 
> implementation.
>
> Thanks,
> -John
>
> On Aug 21, 2013, at 4:14 AM, Daan Hoogland <daan.hoogl...@gmail.com> wrote:
>
>> John,
>>
>> You do want 'In-process Installation and Upgrade', 'Introspection' and
>> 'Discoverability' says that you do want flexibility. You disqualify
>> Spring and OSGi on this quality however.
>
> On the surface, it would appear the OSGi fits into In-process Installation 
> and Upgrade.  However, OSGi assumes a consistency attribute that is too rigid 
> for CloudStack.   As I understand the specification, when a bundle is 
> upgraded, all instances in the container are upgraded simultaneously.  Based 
> on my reading of it, there is no way to customize this behavior.  I think we 
> need the upgrade process will be eventually consistent where by the 
> underlying driver instance for a resource will be upgraded when it is both a 
> consistent and upgradeable state. For example, we have 10,000 KVM hosts, and 
> the KVM driver is upgraded. 9,000 of them are idle, and can take the upgrade 
> immediately.  The other 1,000 are in some state of operation (creating and 
> destroying VMs, taking snapshots, etc).  For these 1,000, we want to the 
> upgrade to happen when they complete their current work.  Most importantly, 
> we don't want any work bound for these 10,000 resources during the upgrade to 
> be lost only delayed.
>
> When I say discoverability, I mean end-users finding drivers to install.  The 
> more I think about it, the more I explicitly do not want drivers to depend on 
> each other.  Drivers should be self-contained, stateless mechanisms that 
> interact with some piece of infrastructure.  I think the path to madness lies 
> in having a messy web of cross-vendor driver dependencies.
>
>>
>> If we can restrict the use of bundles to those that adhere to some
>> interfaces we prescribe I don't think either complexity nor dependency
>> are an issue.
>
> The only restriction I see is the ability of a bundle to control what is 
> publicly exported.  However, I see no way to restrict how bundles depend on 
> each other -- opening the door to cross vendor driver dependencies.
>
>>
>> Most every bit of complexity of writing a bundle can be hidden from
>> the bundle-developer nowadays. If we can not hide enough it is not an
>> option indeed. The main focus of OSGI is life cycle management which
>> is exactly what we need. the use that eclipse makes of it is a good
>> example not to follow but doesn't disqualify the entire thing.
>
> Personally, I am dubious that a build process can mask complexity.  More 
> importantly, I don't like creating designs that require tooling and code 
> generation with a veneer of simplicity but actually create a spooky action at 
> a distance.  I prefer creating truly simple systems that can be easily 
> comprehended.
>
>>
>> The dependency hell is not different from what we have as regular
>> monolithical development group. We control what we package and how. A
>> valid point is that some libraries might have issues that prevent them
>> from being bundled and that needs investigation. So we need to package
>> those libraries as bundles ourselves so 3rd parties don't need to. We
>> package them now anyway.
>
> In my experience, the dependency management problem is magnified by the added 
> hurdle that every dependency be an OSGi bundle.  Many projects do not 
> natively ship OSGi bundles, leaving third-parties or the project itself to 
> repackage them.  Often OSGi bundled versions are behind the most current 
> project releases.
>
>>
>> The erector set fear you have is just as valid with as without osgi or
>> any existing framework.
>
> Agreed.  I prefer inaction on this topic than create said erector set.
>
>>
>> I don't insist on OSGi and I do agree with your initial set of
>> requirements. When I read it I think, "let's use OSGi". And I don't
>> see anything but fear of the beast in your arguments against it. Maybe
>> your fear is just in my perception or maybe it is very valid. I'm not
>> perceptible to it after your reply, yet.
>
> To my mind, OSGi is a wonderful idea.  We need it, or something like it, 
> standard in the JVM.  However, in practice, it is a difficult beast because 
> it working around limitations in the JVM.  When it works, it is awesome until 
> it breaks or you hit the dependency hell I described.  If we adopt it, we 
> need to ensure it will fit our needs and the functional gain merits taking on 
> the burden of its risks.
>
>>
>> regards,
>> Daan
>>
>> On Wed, Aug 21, 2013 at 9:00 AM, John Burwell <jburw...@basho.com> wrote:
>>> Daan,
>>>
>>> I have the following issues with OSGi:
>>>
>>> Complexity:  Building OSGi components adds a tremendous amount of complexity
>>> to both the building drivers and debugging runtime issues.  Additionally,
>>> OSGi has a much broader feature set than I think CloudStack needs to
>>> support.  Therefore, driver authors may use the feature set in unanticipated
>>> way that create system instability.
>>> Dependency Hell: OSGi requires 3rd party dependencies to be packaged as OSGi
>>> bundles.  In practice, many third party libraries either have issues that
>>> prevent them from being bundles or their OSGi bundled versions are behind
>>> mainline release.
>>>
>>>
>>> As an additionally personal experience, I do not want to re-create the mess
>>> that is Eclipse (i.e. an erector set with more screws than nuts).  In
>>> addition to its lack of reliability, it is incredibly difficult to
>>> comprehend how the component configurations and relationships are composed
>>> at runtime.
>>>
>>> To be clear, I am not interested in creating a general purpose
>>> component/plugin model.  Fundamentally, we need a simple, purpose-built
>>> component model focused on providing stability and reliability through
>>> deterministic behavior rather than feature flexibility.  Unfortunately, both
>>> OSGi and Spring's focus on flexibility the later make them ill-suited for
>>> our purposes.
>>>
>>> Thanks,
>>> -John
>>>
>>> On Aug 21, 2013, at 2:31 AM, Daan Hoogland <daan.hoogl...@gmail.com> wrote:
>>>
>>> John,
>>>
>>> Nice work.
>>> Given the maturity of OSGi, I'd say lets see how it fits. One criteria
>>> would be can we limit the bundles that may be loaded based on what
>>> Cloudstack supports (and not allow loading pydev) if not we need to
>>> bake our own.
>>>
>>> But though I think your work is valuable I disagree on designing our
>>> CARs from the get go without having explored usable options in the
>>> field first. A new type of YARs is not what the world or cloudstack
>>> needs. And given what you have written the main problem wll be finding
>>> a framework we can restrict to what we want, not one that can do all
>>> of it.
>>>
>>> done shooting,
>>> Daan
>>>
>>> On Wed, Aug 21, 2013 at 2:52 AM, Darren Shepherd
>>> <darren.s.sheph...@gmail.com> wrote:
>>>
>>> Sure, I fully understand how it theoretically works, but I'm saying from a
>>> practical perspective it always seems to fall apart.  What your describing
>>> is done excellently in OSGI 4.2 Blueprint.  It's a beautiful framework that
>>> allows you to expose services that can be dynamically updated at runtime.
>>>
>>> The issues always happens with unloading.  I'll give you a real world
>>> example.  As part of the servlet spec your supposed to be able to stop and
>>> unload wars.  But in practice if you do it enough times you typically run
>>> out of memory.  So one such issue was with commons logging (since fixed).
>>> When you do getLogger(myclass.class) it would cache a reference of the Class
>>> object to the actual log impl.  The commons logging jar is typically loaded
>>> with a system classloader and but MyClass.class would be loaded in the
>>> webapp classloader.  So when you stop the war there is a reference chain
>>> system classloader -> logfactory -> Myclass -> webapp classloader.  So the
>>> web app never gets GC'd.
>>>
>>> So just pointing out the practical issues, that's it.
>>>
>>> Darren
>>>
>>> On Aug 20, 2013, at 5:31 PM, John Burwell <jburw...@basho.com> wrote:
>>>
>>> Darren,
>>>
>>> Actually, loading and unloading aren't difficult if resource management and
>>> drivers work within the following constraints/assumptions:
>>>
>>> Drivers are transient and stateless
>>> A driver instance is assigned per resource managed (i.e. no singletons)
>>> A lightweight thread and mailbox (i.e. actor model) are assigned per
>>> resource managed (outlined in the presentation referenced below)
>>>
>>>
>>> Based on these constraints and assumptions, the following upgrade process
>>> could be implemented:
>>>
>>> Load and verify new driver version to make it available
>>> Notify the supervisor processes of each affected resource that a new driver
>>> is available
>>> Upon completion of the current message being processed by its associated
>>> actor, the supervisor kills and respawns the actor managing its associated
>>> resource
>>> As part of startup, the supervisor injects an instance of the new driver
>>> version and the actor resumes processing messages in its mailbox
>>>
>>>
>>> This process mirrors the process that would occur on management server
>>> startup for each resource minus killing an existing actor instance.
>>> Eventually, the system will upgrade the driver without loss of operation.
>>> More sophisticated policies could be added, but I think this approach would
>>> be a solid default upgrade behavior.  As a bonus, this same approach could
>>> also be applied to global configuration settings -- allowing the system to
>>> apply changes to these values without restarting the system.
>>>
>>> In summary, CloudStack and Eclipse are very different types of systems.
>>> Eclipse is a desktop application implementing complex workflows, user
>>> interactions, and management of shared state (e.g. project structure, AST,
>>> compiler status, etc).  In contrast, CloudStack is an eventually consistent
>>> distributed system performing automation control.  As such, its requirements
>>> plugin requirements are not only very different, but IMHO, much simpler.
>>>
>>> Thanks,
>>> -John
>>>
>>> On Aug 20, 2013, at 7:44 PM, Darren Shepherd <darren.s.sheph...@gmail.com>
>>> wrote:
>>>
>>> I know this isn't terribly useful, but I've been drawing a lot of squares
>>> and circles and lines that connect those squares and circles lately and I
>>> have a lot of architectural ideas for CloudStack.  At the rate I'm going it
>>> will take me about two weeks to put together a discussion/proposal for the
>>> community.  What I'm thinking is a superset of what you've listed out and
>>> should align with your idea of a CAR.  The focus has a a lot to do with
>>> modularity and extensibility.
>>>
>>> So more to come soon....  I will say one thing though, is with java you end
>>> up having a hard time doing dynamic load and unloading of modules.  There's
>>> plenty of frameworks that try really hard to do this right, like OSGI, but
>>> its darn near impossible to do it right because of class loading and GC
>>> issues (and that's why Eclipse has you restart after installing plugs even
>>> though it is OSGi).
>>>
>>> I do believe that CloudStack should be possible of zero downtime maintenance
>>> and have ideas around that, but at the end of the day, for plenty of
>>> practical reasons, you still need a JVM restart if modules change.
>>>
>>> Darren
>>>
>>> On Aug 20, 2013, at 3:39 PM, Mike Tutkowski <mike.tutkow...@solidfire.com>
>>> wrote:
>>>
>>> I agree, John - let's get consensus first, then talk time tables.
>>>
>>>
>>> On Tue, Aug 20, 2013 at 4:31 PM, John Burwell <jburw...@basho.com> wrote:
>>>
>>> Mike,
>>>
>>> Before we can dig into timelines or implementations, I think we need to
>>> get consensus on the problem to solved and the goals.  Once we have a
>>> proper understanding of the scope, I believe we can chunk the across a set
>>> of development lifecycle.  The subject is vast, but it also has a far
>>> reaching impact to both the storage and network layer evolution efforts.
>>> As such, I believe we need to start addressing it as part of the next
>>> release.
>>>
>>> As a separate thread, we need to discuss the timeline for the next
>>> release.  I think we need to avoid the time compression caused by the
>>> overlap of the 4.1 stabilization effort and 4.2 development.  Therefore, I
>>> don't think we should consider development of the next release started
>>> until the first 4.2 RC is released.  I will try to open a separate discuss
>>> thread for this topic, as well as, tying of the discussion of release code
>>> names.
>>>
>>> Thanks,
>>> -John
>>>
>>> On Aug 20, 2013, at 6:22 PM, Mike Tutkowski <mike.tutkow...@solidfire.com>
>>> wrote:
>>>
>>> Hey John,
>>>
>>> I think this is some great stuff. Thanks for the write up.
>>>
>>> It looks like you have ideas around what might go into a first release of
>>> this plug-in framework. Were you thinking we'd have enough time to
>>>
>>> squeeze
>>>
>>> that first rev into 4.3. I'm just wondering (it's not a huge deal to hit
>>> that release for this) because we would only have about five weeks.
>>>
>>> Thanks
>>>
>>>
>>> On Tue, Aug 20, 2013 at 3:43 PM, John Burwell <jburw...@basho.com>
>>>
>>> wrote:
>>>
>>>
>>> All,
>>>
>>> In capturing my thoughts on storage, my thinking backed into the driver
>>> model.  While we have the beginnings of such a model today, I see the
>>> following deficiencies:
>>>
>>>
>>> 1. *Multiple Models*: The Storage, Hypervisor, and Security layers
>>> each have a slightly different model for allowing system
>>>
>>> functionality to
>>>
>>> be extended/substituted.  These differences increase the barrier of
>>>
>>> entry
>>>
>>> for vendors seeking to extend CloudStack and accrete code paths to be
>>> maintained and verified.
>>> 2. *Leaky Abstraction*:  Plugins are registered through a Spring
>>> configuration file.  In addition to being operator unfriendly (most
>>> sysadmins are not Spring experts nor do they want to be), we expose
>>>
>>> the
>>>
>>> core bootstrapping mechanism to operators.  Therefore, a
>>>
>>> misconfiguration
>>>
>>> could negatively impact the injection/configuration of internal
>>>
>>> management
>>>
>>> server components.  Essentially handing them a loaded shotgun pointed
>>>
>>> at
>>>
>>> our right foot.
>>> 3. *Nondeterministic Load/Unload Model*:  Because the core loading
>>> mechanism is Spring, the management has little control over the
>>>
>>> timing and
>>>
>>> order of component loading/unloading.  Changes to the Management
>>>
>>> Server's
>>>
>>> component dependency graph could break a driver by causing it to be
>>>
>>> started
>>>
>>> at an unexpected time.
>>> 4. *Lack of Execution Isolation*: As a Spring component, plugins are
>>> loaded into the same execution context as core management server
>>> components.  Therefore, an errant plugin can corrupt the entire
>>>
>>> management
>>>
>>> server.
>>>
>>>
>>> For next revision of the plugin/driver mechanism, I would like see us
>>> migrate towards a standard pluggable driver model that supports all of
>>>
>>> the
>>>
>>> management server's extension points (e.g. network devices, storage
>>> devices, hypervisors, etc) with the following capabilities:
>>>
>>>
>>> - *Consolidated Lifecycle and Startup Procedure*:  Drivers share a
>>> common state machine and categorization (e.g. network, storage,
>>>
>>> hypervisor,
>>>
>>> etc) that permits the deterministic calculation of initialization and
>>> destruction order (i.e. network layer drivers -> storage layer
>>>
>>> drivers ->
>>>
>>> hypervisor drivers).  Plugin inter-dependencies would be supported
>>>
>>> between
>>>
>>> plugins sharing the same category.
>>> - *In-process Installation and Upgrade*: Adding or upgrading a driver
>>> does not require the management server to be restarted.  This
>>>
>>> capability
>>>
>>> implies a system that supports the simultaneous execution of multiple
>>> driver versions and the ability to suspend continued execution work
>>>
>>> on a
>>>
>>> resource while the underlying driver instance is replaced.
>>> - *Execution Isolation*: The deployment packaging and execution
>>> environment supports different (and potentially conflicting) versions
>>>
>>> of
>>>
>>> dependencies to be simultaneously used.  Additionally, plugins would
>>>
>>> be
>>>
>>> sufficiently sandboxed to protect the management server against driver
>>> instability.
>>> - *Extension Data Model*: Drivers provide a property bag with a
>>> metadata descriptor to validate and render vendor specific data.  The
>>> contents of this property bag will provided to every driver operation
>>> invocation at runtime.  The metadata descriptor would be a lightweight
>>> description that provides a label resource key, a description
>>>
>>> resource key,
>>>
>>> data type (string, date, number, boolean), required flag, and optional
>>> length limit.
>>> - *Introspection: Administrative APIs/UIs allow operators to
>>> understand the configuration of the drivers in the system, their
>>> configuration, and their current state.*
>>> - *Discoverability*: Optionally, drivers can be discovered via a
>>> project repository definition (similar to Yum) allowing drivers to be
>>> remotely acquired and operators to be notified regarding update
>>> availability.  The project would also provide, free of charge,
>>>
>>> certificates
>>>
>>> to sign plugins.  This mechanism would support local mirroring to
>>>
>>> support
>>>
>>> air gapped management networks.
>>>
>>>
>>> Fundamentally, I do not want to turn CloudStack into an erector set with
>>> more screws than nuts which is a risk with highly pluggable
>>>
>>> architectures.
>>>
>>> As such, I think we would need to tightly bound the scope of drivers and
>>> their behaviors to prevent the loss system usability and stability.  My
>>> thinking is that drivers would be packaged into a custom JAR, CAR
>>> (CloudStack ARchive), that would be structured as followed:
>>>
>>>
>>> - META-INF
>>>  - MANIFEST.MF
>>>  - driver.yaml (driver metadata(e.g. version, name, description,
>>>  etc) serialized in YAML format)
>>>  - LICENSE (a text file containing the driver's license)
>>> - lib (driver dependencies)
>>> - classes (driver implementation)
>>> - resources (driver message files and potentially JS resources)
>>>
>>>
>>> The management server would acquire drivers through a simple scan of a
>>>
>>> URL
>>>
>>> (e.g. file directory, S3 bucket, etc).  For every CAR object found, the
>>> management server would create an execution environment (likely a
>>>
>>> dedicated
>>>
>>> ExecutorService and Classloader), and transition the state of the
>>>
>>> driver to
>>>
>>> Running (the exact state model would need to be worked out).  To be
>>>
>>> really
>>>
>>> nice, we could develop a custom Ant task/Maven plugin/Gradle plugin to
>>> create CARs.   I can also imagine an opportunities to add hooks to this
>>> model to register instrumentation information with JMX and
>>>
>>> authorization.
>>>
>>>
>>> To keep the scope of this email confined, we would introduce the general
>>> notion of a Resource, and (hand wave hand wave) eventually
>>>
>>> compartmentalize
>>>
>>> the execution of work around a resource [1].  This (hand waved)
>>> compartmentalization would allow us the controls necessary to safely and
>>> reliably perform in-place driver upgrades.  For an initial release, I
>>>
>>> would
>>>
>>> recommend implementing the abstractions, loading mechanism, extension
>>>
>>> data
>>>
>>> model, and discovery features.  With these capabilities in place, we
>>>
>>> could
>>>
>>> attack the in-place upgrade model.
>>>
>>> If we were to adopt such a pluggable capability, we would have the
>>> opportunity to decouple the vendor and CloudStack release schedules.
>>>
>>> For
>>>
>>> example, if a vendor were introducing a new product that required a new
>>>
>>> or
>>>
>>> updated driver, they would no longer need to wait for a CloudStack
>>>
>>> release
>>>
>>> to support it.  They would also gain the ability to fix high priority
>>> defects in the same manner.
>>>
>>> I have hand waved a number of issues that would need to be resolved
>>>
>>> before
>>>
>>> such an approach could be implemented.  However, I think we need to
>>>
>>> decide,
>>>
>>> as a community, that it worth devoting energy and effort to enhancing
>>>
>>> the
>>>
>>> plugin/driver model and the goals of that effort before driving head
>>>
>>> first
>>>
>>> into the deep rabbit hole of design/implementation.
>>>
>>> Thoughts? (/me ducks)
>>> -John
>>>
>>> [1]: My opinions on the matter from CloudStack Collab 2013 ->
>>>
>>> http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-stack-distributed-process-management
>>>
>>>
>>>
>>>
>>> --
>>> *Mike Tutkowski*
>>> *Senior CloudStack Developer, SolidFire Inc.*
>>> e: mike.tutkow...@solidfire.com
>>> o: 303.746.7302
>>> Advancing the way the world uses the
>>> cloud<http://solidfire.com/solution/overview/?video=play>
>>> *™*
>>>
>>>
>>>
>>> --
>>> *Mike Tutkowski*
>>> *Senior CloudStack Developer, SolidFire Inc.*
>>> e: mike.tutkow...@solidfire.com
>>> o: 303.746.7302
>>> Advancing the way the world uses the
>>> cloud<http://solidfire.com/solution/overview/?video=play>
>>> *™*
>>>
>>>
>>>
>

Re: [DISCUSS/PROPOSAL] Upgrading Driver Model

Reply via email to