Re: [Dev] Upgrading Chandler

John Anderson Mon, 10 Oct 2005 11:55:33 -0700

Hi Phillip:

I wanted to mention again that we could probably make our schemaevolution task much easier if we only migrate Items and Collections andnot the UI Instances, and instead just delete them from the repository.The consequence is that you'd still have all your data but you'd get anew UI, which isn't so bad. Since there are a lot of UI classes, thiscould be a real time saver.


John

Phillip J. Eby wrote:

Overview
========
With the advent of usable calendaring in 0.6, we have a new and scarything to think about: needing to support actual users. :) Or morespecifically, being able to upgrade a Chandler installation withoutrecreating all its data.
There are four kinds of things that we need to be able to upgrade:

1. Python code
2. Parcel-defined items, including UI items
3. Parcel-defined repository schema
4. User data that may need to be changed to reflect a schema change
Few - if any - of these items can currently be upgraded withoutrecreating your repository. Many are largely unexplored problems.
Luckily, we don't need to solve all of these upgrade problems for 0.6,although now is a good time to start thinking about them, to make surethat we have at least some basis for doing them in the future.
In this proposal, I'll be focusing first on how we can make itpossible to make code and UI changes without needing even to *restart*Chandler, so that developers can make and test changes more quickly.But I'll also be exploring what we can do to detect schema changes orparcel version changes, so that we're in a better position to supportfuture upgrades.
Reloading Code
==============
In general, reloading Python code is a hard problem to completelysolve. This is because a module that imports another module may useimported objects during its initialization - for example to subclassan imported class. This means that if the imported module isreloaded, the importing module can become out-of-date.
However, for most simple development use cases - which mainly involvechanges to functions or to methods of existing classes, it should bepossible to work around this issue. I propose to add a metaclass tothe schema API that will allow classes to be redefined during areload() operation in such a way that the original class is modifiedin-place, instead of being replaced with a new class. This will allowa simple reload() operation on a module to update the methods of aclass. And, by default, Item classes will have this ability.Non-Item classes will need to make explicit use of the metaclass.
There are, however, some side effects. The metaclass will have no wayto know whether a reload is taking place, except by whether there isalready a symbol of the same name as the class in the module. When amodule is reloaded, the existing version of the class will still be inthe module's dictionary when the new version is being defined. So,the metaclass will check for the existing class, and then update thatexisting class instead of replacing it.
Name Collisions
---------------
The downside to this approach is that it can be fooled into thinking areload is taking place, if an object of the same name already existsin the module at that point in time. For example, this is perfectlylegal Python code, but will not work the same way once the metaclassis used::
    from somewhere import SomeItemClass

    class SomeItemClass(SomeItemClass):
        def foo(self):
            return self.bar
Without the metaclass, this does exactly what it looks like it does -it creates a ``SomeItemClass`` subclass of``somewhere.SomeItemClass``. But *with* the metaclass, this will*overwrite* ``somewhere.SomeItemClass`` with the contents of the newclass, because the metaclass will think you are reloading the module.
Actually, in this simple example, the metaclass could check the__module__ of the class in question, and give you an error message.The error would occur even at initial import, and you'd quickly changeyour code to something like this:
    from somewhere import SomeItemClass as _SomeItemClass

    class SomeItemClass(_SomeItemClass):
        def foo(self):
            return self.bar
which would immediately fix the problem. However, if you do somethinglike this:
    class SomeItemClass(schema.Item):
        pass

    class SomeItemClass(SomeItemClass):
        pass
there is no way to detect the problem, at least if we also allowchanging a class' inheritance tree when code is reloaded. If werequire a class' base classes to remain the same across reloads, thenwe could detect this error by virtue of the different inheritance, andwe could again give you an error message so you'd change your code.
This is probably the best option, although it prevents you changing aclass' bases without restarting Chandler. I would expect base classchanges to be rare, however, so this is probably an acceptableconvenience vs. safety tradeoff. I propose the error message for anyof the above collisions to read something like:
NameError: SomeItemClass already defined in module blah.blah;please rename either the existing class or the new class
And it would occur as soon as the name collision exists, not just atreload time. However, if you only introduce the collision betweenreloads, then of course it will occur when you reload.
The metaclass would be called ``schema.ReloadableClass``, so if youneed to use it in a non-Item class, you would do something like::
   class MyArbitraryNonItemClass(SomeBase):
       __metaclass__ = schema.ReloadableClass

And the same name collision rules would apply as for item classes.


Reloading Functions
-------------------
To support reloading of module-level functions, there will be a``schema.reloadable`` decorator, used as follows::
    @schema.reloadable
    def some_function(some_arg, other_arg, ...):
        # whatever
The purpose of this decorator is to allow a function to be updatedin-place, even if another module has already imported it. The onlytime you would use this is if you are changing the function and wantto reload it. In other words, the function would normally look likethis::
    def some_function(some_arg, other_arg, ...):
        # whatever
If you need to change the function while Chandler is running, then youwould add the [EMAIL PROTECTED] line, make the change, and reloadthe module. But, before you check your changes back in to Subversion,you should remove the decorator, just as you would remove debuggingprints. It's strictly a development tool, needed only for top-levelfunctions, and only ones that you're editing while Chandler is running.
There are some rather strict limitations on what this decorator cando, by the way. It must be the "outermost" (first) decorator for agiven function, and any nested decorators must preserve the functionname in any transform. You won't be able to add new requiredarguments, or rename the previous arguments. However, these kinds ofchanges are unlikely to be the sort you could make without restartingChandler anyway.
The most likely place where you'd need to use this decorator right nowis on ``installParcel()`` functions that are defined in one module,but *used* in another via importing. This would also apply to utilityroutines defined in one module, but imported in another module for useby an ``installParcel()`` function. For example, if you have a parcelthat does this:
    from some.where import createMenus

    def installParcel(parcel, oldVersion=None):
        createMenus(parcel)
You would need to add the [EMAIL PROTECTED] decorator to the``createMenus()`` function definition in ``some.where`` if you wantedto change ``createMenus()`` without restarting Chandler. (Of course,you would then also need to reload the parcels that are using the``createMenus()`` function, which is the subject of the next sectionof this proposal.)
Updating Parcel-Defined Items and UI
====================================
Merely reloading a Python module doesn't affect what items are in therepository, even if you've edited the ``installParcel()`` function ora utility function it calls. So, there needs to be a way to reload aparcel and update the items it contains.
Luckily, the mechanisms normally used in ``installParcel()`` shouldupdate existing items in-place, so really the only special thing thatneeds to be done to allow updating on-the-fly is providing a way tore-invoke ``installParcel()``.
My current thought is that the way to expose this API would be to adda ``reload()`` method to ``schema.ns()``, e.g.::
    pim = schema.ns('osaf.pim', view)
    pim.reload()  # reload the osaf.pim parcel (but not subparcels!)
This would perform a reload of the module (and the package, if theparcel is a package), and then reinvoke the ``installParcel()`` forthe parcel, to reload the items. Since this would also take care ofreloading code, this would probably be the thing to run to update achanged parcel. Someone could perhaps provide a test-menu option todo this, that would ask for the parcel name. Of course, it could alsobe done by just dropping into a PyShell. Users of the 'headless'utility, or those running Chandler under a debugger, could also invokethe operation directly.
This feature will *not*, however, handle general updates to therepository schema. In fact, only one kind of schema change will besupported: adding new classes. If you add a class to a parcel andreload it -- assuming you've done the import in __init__.py, if needed-- then the new kind will become available. Changes to existingclasses will be ignored, unless you recreate the repository. Which iswhy the next section will talk about...
Updating Chandler Schema
========================
"Do you, Programmer, take this Object to be part of the persistentstate of your application, to have and to hold, through maintenanceand iterations, for past and future versions, as long as theapplication shall live?"
    "Erm, can I get back to you on that?"

    -- from "Making a class serializable",
       http://www.erights.org/e/StateSerialization.html
In general, schema evolution is a hard problem. So what I'd like todo here is first lay out some background to show just *how* hard, andthen backpedal a bit to what more specific goals I think areachievable with what we're doing in 0.6 and 0.7.
Schema Additions
----------------
But first, something simple. Additive changes to the schema arerelatively easy compared to other kinds of change, since they cansometimes be done without changing existing items. In fact, addingnew kinds can be done without even restarting Chandler, as we saw inthe previous section. This is especially nice in that it means we'llbe able to download and install new parcels while Chandler is running- but upgrading an already-installed parcel will require a restart forstability.
Adding new attributes to existing kinds is a little trickier, becauseright now the schema API doesn't scan a kind's attributes if the kindalready exists in the repository. But we could add something thatwould check a parcel's version and do a thorough re-scan of every kinddefined by the parcel, whenever the parcel version changed. Thiswould be part of an at-startup check of parcel versions.
The major complication introduced by adding attributes is attributesthat should have a value for existing items in the repository. Interms of repository stability, this is not a big deal, as therepository doesn't care that the attributes are missing unless theyare marked ``required``, and you run ``check()``.
However, for application functionality, it means that new versions ofparcels must either:
1. Never assume an attribute exists, unless it was supplied andinitialized by the first public release of the parcel, and *everyrelease since*. Or,
2. Use ``defaultValue``, so the attribute always appears to have a value
The downside of option 1 is that you have to keep track of what youreleased, and when you changed it, "through maintenance anditerations, for past and future versions, as long as the applicationshall live." The downside of option 2 is that the attribute can never*not* have a value, and there may be other limitations associated with``defaultValue``, which we have mostly not been using for some time.
Note that ``defaultValue`` is different from ``initialValue``. An``initialValue`` is set when an item is created. If you later deletethe attribute, the ``initialValue`` does not come back. Similarly, ifyou add a new attribute with an ``initialValue``, or change the``initialValue`` of an existing attribute definition, this does notaffect already-created items, even if they don't have a value for thatattribute.
Incidentally, this is somewhat related to the issue that we'vesometimes had with ``initialValue``, in that we would often like anattribute's initial value to be computed, rather than a constant. Forexample, creation and modification dates want to default to the``datetime.now()`` at the time the item is created. It may also bethat we would like to be able to have some code run for existing itemsand set a computed initial value, when a new attribute is added.
One way we can accomplish this is to have the relevant``installParcel()`` function include a block of code like::
    def installParcel(parcel, oldVersion=None):

        # ...

        for item in SomeChangedClass.iterItems(parcel.itsView):
            if not hasattr(item,"newattr"):
                item.newattr = some_calculation(item)
Since ``installParcel()`` is only invoked when a parcel is installed,upgraded, or explicitly reloaded, this operation would be reasonablein many cases, especially since it will not do any work when theparcel is first installed (because there will be no items of thechanged class yet).
However, for upgrades and reloads, it could possibly be quite slow,and might need some way to display or update a progress meter. Butthe mechanism for this needs to somehow be decoupled from the schemaAPI and the standard Chandler UI, because it also needs to work whenrun under ``headless``, and of course unit tests need to work too.
Oh, and don't forget - you can't ever remove that upgrade code from``installParcel()``, unless of course you stop using the attribute.
Ah, if only all schema evolution issues were as simple as additions!  :)


Moves and Renames
-----------------
Additions, alas, are not the only kind of schema changes we're likelyto have in future versions. It's extremely likely that in 0.7 we'llbe doing a lot of moves and renames to finish our parcel/packageflattening and the move to a standardized layout for API packages.
But, we currently use the names and locations of modules, classes, andattributes to synchronize our schema definition with the schema storedin the repository. This means that if we move a class around, orrename it, it no longer has an identifier matching that of the Kind inthe repository. So, even if we made no *actual* change to the schema,we can completely trash someone's existing data just by moving orrenaming things in the normal course of refactoring.
Indeed, the repository stores in each Kind a reference to the classthat implements it, so even if we grabbed the existing Kind and triedto move or rename it in our ``installParcel()`` routines, we would getan error. Even though the old class doesn't exist any more, therepository would try to load it because it's still referred to by theexisting Kind. Andi says this particular issue can probably be workedaround in the repository, but it's only the tip of the iceberg here.
The real issue here is *identification*. How do we uniquely identifya class or attribute or parcel, once it has moved? One possibility isto give each such object a list of all paths that it previously livedin in other versions, so that it could check all those locations andthen move the relevant item to the new location. Of course, such alist could grow longer over time, and can never be removed.
Another possibility is to assign fixed UUIDs to the items. If youmoved a class, attribute, or parcel, you would first need to find outits current UUID, and add that information to the source code. Then,when you move or rename an item, its UUID would move with it, andremain in sync.
How would we initially assign UUIDs? Well, there is a kind of UUIDcalled a "namespace UUID" which would be useful for this purpose. Anamespace UUID is generated by hashing a name string with a base UUID,to create a new UUID. This would allow us to automatically generatefixed UUIDs for items based on their name, so that it won't benecessary to manually assign individual UUIDs for every parcel, class,and attribute. But when you move or rename something, you would needto find out its assigned UUID, and add it to the code, or else you'llbe creating a new item and abandoning the old one.
Error checking, of course, is a problem with this scheme, since justrenaming or moving something isn't going to move the UUID. Also, ifin one upgrade you rename class X to "Y", and then in a later upgradeyou create a new "X", the new "X" will be assigned the same UUID thatthe original X was, and now you have to detect the collision.
Alternately, we could require that parcels and classes be manuallyassigned UUIDs, but we could allow attribute UUIDs to be automaticallygenerated. It's easier to crosscheck a single class' attributes fornaming collisions in the "rename X to Y and create a new X" case, andfor everything else it guarantees that the UUID will move with theparcel or class, and ensure continuity.
At Chandler's current size, that's a few hundred UUIDs we would needto manually assign. The process would simply require running a toollike uuidgen to generate the UUIDs, and then adding them to the'kindInfo()' for each class, and a __parcel_id__ = "..." assignment inthe parcel's main module. It also adds a step to creating a new kindor parcel, but not a particularly difficult one.
Changes and Deletions
---------------------
The next type of schema change that can occur is changes to metadata.For classes, clouds, and parcels, metadata changes are fairlyharmless, as they don't usually affect the user's data in any way.For attributes, however, changes to metadata like type or cardinalitycould require changing all existing values of that attribute. Suchchanges would actually require creating a new attribute and copyingthe old values over to it, then deleting the old attribute. And it'snot immediately obvious how we'd go about doing that.
Deletions from the schema are likely to be rare, at least when viewedin the "upgrade" direction. But we may also need to consider the"downgrade" case, where someone wants to revert to a previous versionof a package, and therefore needs to undo schema changes. Implicitlythis can involve removing some part of the schema that was added,although in practice there is no actual need to remove it, since itwill be inaccessible. It does mean, however, that the upgrademechanism will eventually need to be robust in the face of repeatedupgrades.
Analysis
========
Providing robust support for schema changes is an "interesting"problem. Once a schema is officially released, it appears thatconstant vigilance will be required, to ensure that changes alwaysprovide a migration path for users' data. We do not currently haveany ways to validate changes made between a particular pair of schemaversions, nor to track the changes made (other than indirectly, viasource code changes).
With sufficient care and infrastructure support, we can relativelyeasily support manual schema upgrades, in the sense of havinginstallParcel() make the changes, if we entirely forbid certainclasses of schema change that could not be implemented in this way.However, the amount of developer care required currently appearsprohibitive, in the sense that it's going to seriously impede ourflexibility to refactor.
In order to remove these impediments, we would have to have some wayof automatically tracking changes to the schema, as a kind of revisionlog that would be kept alongside the code in Subversion. Whenchanging between versions of a parcel, it should be possible for thesystem to automatically apply the relevant changes to both the schema,and the corresponding data. This logging system would then also beable to prohibit (via error messages) any changes that could not besupported without extending the revision system. The actual trackingmechanism will probably need to use some of the techniques describedhere:
http://citeseer.ist.psu.edu/staudtlerner96model.html
There would also need to be some tools to manipulate the log. Forexample, to mark a point in the log as corresponding to a particularrelease version, to simplify applying changes. Or to list the changesbetween releases, etc. I'm not going to attempt to fully specify thesystem at this time, since I don't think we can reasonably includesomething like it in 0.6. We may be forced to simply require that 0.6-> 0.7 upgrades go through an export-and-import process.
My original intent with this proposal was to try to support someminimal schema versioning support in 0.6, but as the analysisprogressed it has become apparent that, given the complexity and thelateness of the date, it's going to be simpler to just introduceparcel versioning in 0.7 alongside the introduction of Python Eggs,since eggs include version metadata already, and they provide anatural boundary for the schema revision tracking system described above.
In short, I think the only part of these proposals that can reasonablybe implemented for 0.6 are the parts to support reloading code andparcel-defined items, which should be helpful to developers working oncode changes that currently require a Chandler restart before testing.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "Dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/dev


_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "Dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/dev

Re: [Dev] Upgrading Chandler

Reply via email to