Re: [DISCUSS/PROPOSAL] Upgrading Driver Model

John Burwell Tue, 20 Aug 2013 17:47:34 -0700

Darren,

My response does hand wave two important issues -- hot code reloading and 
PermGen leakage.  These are tricky, but well trod issues that can be solved in 
variety of ways (e.g. instrumentation, class loaders, OSGi).  It would require 
a some research/experimentation to determine the best approach particularly 
when using a lightweight threading model.


Thanks,
-John

On Aug 20, 2013, at 8:31 PM, John Burwell <[email protected]> wrote:

> Darren,
> 
> Actually, loading and unloading aren't difficult if resource management and 
> drivers work within the following constraints/assumptions:
> 
> Drivers are transient and stateless
> A driver instance is assigned per resource managed (i.e. no singletons)
> A lightweight thread and mailbox (i.e. actor model) are assigned per resource 
> managed (outlined in the presentation referenced below)
> 
> Based on these constraints and assumptions, the following upgrade process 
> could be implemented:
> 
> Load and verify new driver version to make it available
> Notify the supervisor processes of each affected resource that a new driver 
> is available
> Upon completion of the current message being processed by its associated 
> actor, the supervisor kills and respawns the actor managing its associated 
> resource 
> As part of startup, the supervisor injects an instance of the new driver 
> version and the actor resumes processing messages in its mailbox
> 
> This process mirrors the process that would occur on management server 
> startup for each resource minus killing an existing actor instance.  
> Eventually, the system will upgrade the driver without loss of operation.  
> More sophisticated policies could be added, but I think this approach would 
> be a solid default upgrade behavior.  As a bonus, this same approach could 
> also be applied to global configuration settings -- allowing the system to 
> apply changes to these values without restarting the system.
> 
> In summary, CloudStack and Eclipse are very different types of systems.  
> Eclipse is a desktop application implementing complex workflows, user 
> interactions, and management of shared state (e.g. project structure, AST, 
> compiler status, etc).  In contrast, CloudStack is an eventually consistent 
> distributed system performing automation control.  As such, its requirements 
> plugin requirements are not only very different, but IMHO, much simpler.
> 
> Thanks,
> -John
> 
> On Aug 20, 2013, at 7:44 PM, Darren Shepherd <[email protected]> 
> wrote:
> 
>> I know this isn't terribly useful, but I've been drawing a lot of squares 
>> and circles and lines that connect those squares and circles lately and I 
>> have a lot of architectural ideas for CloudStack.  At the rate I'm going it 
>> will take me about two weeks to put together a discussion/proposal for the 
>> community.  What I'm thinking is a superset of what you've listed out and 
>> should align with your idea of a CAR.  The focus has a a lot to do with 
>> modularity and extensibility.  
>> 
>> So more to come soon....  I will say one thing though, is with java you end 
>> up having a hard time doing dynamic load and unloading of modules.  There's 
>> plenty of frameworks that try really hard to do this right, like OSGI, but 
>> its darn near impossible to do it right because of class loading and GC 
>> issues (and that's why Eclipse has you restart after installing plugs even 
>> though it is OSGi).   
>> 
>> I do believe that CloudStack should be possible of zero downtime maintenance 
>> and have ideas around that, but at the end of the day, for plenty of 
>> practical reasons, you still need a JVM restart if modules change.   
>> 
>> Darren
>> 
>> On Aug 20, 2013, at 3:39 PM, Mike Tutkowski <[email protected]> 
>> wrote:
>> 
>>> I agree, John - let's get consensus first, then talk time tables.
>>> 
>>> 
>>> On Tue, Aug 20, 2013 at 4:31 PM, John Burwell <[email protected]> wrote:
>>> 
>>>> Mike,
>>>> 
>>>> Before we can dig into timelines or implementations, I think we need to
>>>> get consensus on the problem to solved and the goals.  Once we have a
>>>> proper understanding of the scope, I believe we can chunk the across a set
>>>> of development lifecycle.  The subject is vast, but it also has a far
>>>> reaching impact to both the storage and network layer evolution efforts.
>>>> As such, I believe we need to start addressing it as part of the next
>>>> release.
>>>> 
>>>> As a separate thread, we need to discuss the timeline for the next
>>>> release.  I think we need to avoid the time compression caused by the
>>>> overlap of the 4.1 stabilization effort and 4.2 development.  Therefore, I
>>>> don't think we should consider development of the next release started
>>>> until the first 4.2 RC is released.  I will try to open a separate discuss
>>>> thread for this topic, as well as, tying of the discussion of release code
>>>> names.
>>>> 
>>>> Thanks,
>>>> -John
>>>> 
>>>> On Aug 20, 2013, at 6:22 PM, Mike Tutkowski <[email protected]>
>>>> wrote:
>>>> 
>>>>> Hey John,
>>>>> 
>>>>> I think this is some great stuff. Thanks for the write up.
>>>>> 
>>>>> It looks like you have ideas around what might go into a first release of
>>>>> this plug-in framework. Were you thinking we'd have enough time to
>>>> squeeze
>>>>> that first rev into 4.3. I'm just wondering (it's not a huge deal to hit
>>>>> that release for this) because we would only have about five weeks.
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> 
>>>>> On Tue, Aug 20, 2013 at 3:43 PM, John Burwell <[email protected]>
>>>> wrote:
>>>>> 
>>>>>> All,
>>>>>> 
>>>>>> In capturing my thoughts on storage, my thinking backed into the driver
>>>>>> model.  While we have the beginnings of such a model today, I see the
>>>>>> following deficiencies:
>>>>>> 
>>>>>> 
>>>>>> 1. *Multiple Models*: The Storage, Hypervisor, and Security layers
>>>>>> each have a slightly different model for allowing system
>>>> functionality to
>>>>>> be extended/substituted.  These differences increase the barrier of
>>>> entry
>>>>>> for vendors seeking to extend CloudStack and accrete code paths to be
>>>>>> maintained and verified.
>>>>>> 2. *Leaky Abstraction*:  Plugins are registered through a Spring
>>>>>> configuration file.  In addition to being operator unfriendly (most
>>>>>> sysadmins are not Spring experts nor do they want to be), we expose
>>>> the
>>>>>> core bootstrapping mechanism to operators.  Therefore, a
>>>> misconfiguration
>>>>>> could negatively impact the injection/configuration of internal
>>>> management
>>>>>> server components.  Essentially handing them a loaded shotgun pointed
>>>> at
>>>>>> our right foot.
>>>>>> 3. *Nondeterministic Load/Unload Model*:  Because the core loading
>>>>>> mechanism is Spring, the management has little control over the
>>>> timing and
>>>>>> order of component loading/unloading.  Changes to the Management
>>>> Server's
>>>>>> component dependency graph could break a driver by causing it to be
>>>> started
>>>>>> at an unexpected time.
>>>>>> 4. *Lack of Execution Isolation*: As a Spring component, plugins are
>>>>>> loaded into the same execution context as core management server
>>>>>> components.  Therefore, an errant plugin can corrupt the entire
>>>> management
>>>>>> server.
>>>>>> 
>>>>>> 
>>>>>> For next revision of the plugin/driver mechanism, I would like see us
>>>>>> migrate towards a standard pluggable driver model that supports all of
>>>> the
>>>>>> management server's extension points (e.g. network devices, storage
>>>>>> devices, hypervisors, etc) with the following capabilities:
>>>>>> 
>>>>>> 
>>>>>> - *Consolidated Lifecycle and Startup Procedure*:  Drivers share a
>>>>>> common state machine and categorization (e.g. network, storage,
>>>> hypervisor,
>>>>>> etc) that permits the deterministic calculation of initialization and
>>>>>> destruction order (i.e. network layer drivers -> storage layer
>>>> drivers ->
>>>>>> hypervisor drivers).  Plugin inter-dependencies would be supported
>>>> between
>>>>>> plugins sharing the same category.
>>>>>> - *In-process Installation and Upgrade*: Adding or upgrading a driver
>>>>>> does not require the management server to be restarted.  This
>>>> capability
>>>>>> implies a system that supports the simultaneous execution of multiple
>>>>>> driver versions and the ability to suspend continued execution work
>>>> on a
>>>>>> resource while the underlying driver instance is replaced.
>>>>>> - *Execution Isolation*: The deployment packaging and execution
>>>>>> environment supports different (and potentially conflicting) versions
>>>> of
>>>>>> dependencies to be simultaneously used.  Additionally, plugins would
>>>> be
>>>>>> sufficiently sandboxed to protect the management server against driver
>>>>>> instability.
>>>>>> - *Extension Data Model*: Drivers provide a property bag with a
>>>>>> metadata descriptor to validate and render vendor specific data.  The
>>>>>> contents of this property bag will provided to every driver operation
>>>>>> invocation at runtime.  The metadata descriptor would be a lightweight
>>>>>> description that provides a label resource key, a description
>>>> resource key,
>>>>>> data type (string, date, number, boolean), required flag, and optional
>>>>>> length limit.
>>>>>> - *Introspection: Administrative APIs/UIs allow operators to
>>>>>> understand the configuration of the drivers in the system, their
>>>>>> configuration, and their current state.*
>>>>>> - *Discoverability*: Optionally, drivers can be discovered via a
>>>>>> project repository definition (similar to Yum) allowing drivers to be
>>>>>> remotely acquired and operators to be notified regarding update
>>>>>> availability.  The project would also provide, free of charge,
>>>> certificates
>>>>>> to sign plugins.  This mechanism would support local mirroring to
>>>> support
>>>>>> air gapped management networks.
>>>>>> 
>>>>>> 
>>>>>> Fundamentally, I do not want to turn CloudStack into an erector set with
>>>>>> more screws than nuts which is a risk with highly pluggable
>>>> architectures.
>>>>>> As such, I think we would need to tightly bound the scope of drivers and
>>>>>> their behaviors to prevent the loss system usability and stability.  My
>>>>>> thinking is that drivers would be packaged into a custom JAR, CAR
>>>>>> (CloudStack ARchive), that would be structured as followed:
>>>>>> 
>>>>>> 
>>>>>> - META-INF
>>>>>>    - MANIFEST.MF
>>>>>>    - driver.yaml (driver metadata(e.g. version, name, description,
>>>>>>    etc) serialized in YAML format)
>>>>>>    - LICENSE (a text file containing the driver's license)
>>>>>> - lib (driver dependencies)
>>>>>> - classes (driver implementation)
>>>>>> - resources (driver message files and potentially JS resources)
>>>>>> 
>>>>>> 
>>>>>> The management server would acquire drivers through a simple scan of a
>>>> URL
>>>>>> (e.g. file directory, S3 bucket, etc).  For every CAR object found, the
>>>>>> management server would create an execution environment (likely a
>>>> dedicated
>>>>>> ExecutorService and Classloader), and transition the state of the
>>>> driver to
>>>>>> Running (the exact state model would need to be worked out).  To be
>>>> really
>>>>>> nice, we could develop a custom Ant task/Maven plugin/Gradle plugin to
>>>>>> create CARs.   I can also imagine an opportunities to add hooks to this
>>>>>> model to register instrumentation information with JMX and
>>>> authorization.
>>>>>> 
>>>>>> To keep the scope of this email confined, we would introduce the general
>>>>>> notion of a Resource, and (hand wave hand wave) eventually
>>>> compartmentalize
>>>>>> the execution of work around a resource [1].  This (hand waved)
>>>>>> compartmentalization would allow us the controls necessary to safely and
>>>>>> reliably perform in-place driver upgrades.  For an initial release, I
>>>> would
>>>>>> recommend implementing the abstractions, loading mechanism, extension
>>>> data
>>>>>> model, and discovery features.  With these capabilities in place, we
>>>> could
>>>>>> attack the in-place upgrade model.
>>>>>> 
>>>>>> If we were to adopt such a pluggable capability, we would have the
>>>>>> opportunity to decouple the vendor and CloudStack release schedules.
>>>> For
>>>>>> example, if a vendor were introducing a new product that required a new
>>>> or
>>>>>> updated driver, they would no longer need to wait for a CloudStack
>>>> release
>>>>>> to support it.  They would also gain the ability to fix high priority
>>>>>> defects in the same manner.
>>>>>> 
>>>>>> I have hand waved a number of issues that would need to be resolved
>>>> before
>>>>>> such an approach could be implemented.  However, I think we need to
>>>> decide,
>>>>>> as a community, that it worth devoting energy and effort to enhancing
>>>> the
>>>>>> plugin/driver model and the goals of that effort before driving head
>>>> first
>>>>>> into the deep rabbit hole of design/implementation.
>>>>>> 
>>>>>> Thoughts? (/me ducks)
>>>>>> -John
>>>>>> 
>>>>>> [1]: My opinions on the matter from CloudStack Collab 2013 ->
>>>> http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-stack-distributed-process-management
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> *Mike Tutkowski*
>>>>> *Senior CloudStack Developer, SolidFire Inc.*
>>>>> e: [email protected]
>>>>> o: 303.746.7302
>>>>> Advancing the way the world uses the
>>>>> cloud<http://solidfire.com/solution/overview/?video=play>
>>>>> *™*
>>> 
>>> 
>>> -- 
>>> *Mike Tutkowski*
>>> *Senior CloudStack Developer, SolidFire Inc.*
>>> e: [email protected]
>>> o: 303.746.7302
>>> Advancing the way the world uses the
>>> cloud<http://solidfire.com/solution/overview/?video=play>
>>> *™*
>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [DISCUSS/PROPOSAL] Upgrading Driver Model

Reply via email to