Phillip,
I've got some concerns about this proposal: the ability to upgrade seems
centered around being able to upgrade running Python code on-the-fly, as
well as to upgrade Chandler's schema without quitting Chandler. These
abilities appear to complicate the solution you've chosen, and I don't
think we need either of them.
I think there are three use cases to focus on here:
- Regular users
- QA folks and serious dogfooders
- Developers
Regular users are the most important case: they'll be upgrading Chandler
relatively infrequently (say, weekly at most for the foreseeable future
if they're really avid trackers of our development, then less often),
and I'm guessing that it wouldn't be a hardship to require that such an
upgrade require quitting Chandler for a moment. Thus, on-the-fly
upgrading doesn't buy them much.
QA people and serious dogfooders are facing more frequent upgrades -
they clearly need mechanisms that "automatically" evolve Chandler, but
they also need a stable platform on which to test, and I'm concerned
that trying to achieve full dynamic upgradability will actually
complicate the testing environment, because we'll end up having to test
the dynamic upgrade mechanism for every schema evolution; reporting,
investigating, and validating the bugs that result will add substantial
QA workload (again, for little benefit to users).
Some developers working on Chandler would like a means to make code and
UI changes without needing to restart Chandler (I'm not one of them, but
I know Donn is); I think there may be ways to achieve this in certain
areas without having it affect our coding strategies in the pervasive
way that you propose. Developers adding to Chandler could also benefit
from the ability to dynamically reload their code, but your proposal
also means that all developers will face a steeper Chandler learning
curve, and I think our efforts would be better spent simplifying the
model rather than complicating it.
Instead, I'd suggest a much simpler strategy is appropriate:
- Build a mechanism for schema upgrades that runs against a closed
repository; it could run as a separate tool, or at Chandler startup time
when Chandler detects an older repository.
- Take advantage of John's suggestion to have the upgrade mechanism be
cognizant of our ability to discard and reload UI blocks completely;
only worry about upgrading content item schema and instances, and
preference/account settings.
...Bryan
Phillip J. Eby wrote:
Overview
========
With the advent of usable calendaring in 0.6, we have a new and scary
thing to think about: needing to support actual users. :) Or more
specifically, being able to upgrade a Chandler installation without
recreating all its data.
There are four kinds of things that we need to be able to upgrade:
1. Python code
2. Parcel-defined items, including UI items
3. Parcel-defined repository schema
4. User data that may need to be changed to reflect a schema change
Few - if any - of these items can currently be upgraded without
recreating your repository. Many are largely unexplored problems.
Luckily, we don't need to solve all of these upgrade problems for 0.6,
although now is a good time to start thinking about them, to make sure
that we have at least some basis for doing them in the future.
In this proposal, I'll be focusing first on how we can make it
possible to make code and UI changes without needing even to *restart*
Chandler, so that developers can make and test changes more quickly.
But I'll also be exploring what we can do to detect schema changes or
parcel version changes, so that we're in a better position to support
future upgrades.
Reloading Code
==============
In general, reloading Python code is a hard problem to completely
solve. This is because a module that imports another module may use
imported objects during its initialization - for example to subclass
an imported class. This means that if the imported module is
reloaded, the importing module can become out-of-date.
However, for most simple development use cases - which mainly involve
changes to functions or to methods of existing classes, it should be
possible to work around this issue. I propose to add a metaclass to
the schema API that will allow classes to be redefined during a
reload() operation in such a way that the original class is modified
in-place, instead of being replaced with a new class. This will allow
a simple reload() operation on a module to update the methods of a
class. And, by default, Item classes will have this ability.
Non-Item classes will need to make explicit use of the metaclass.
There are, however, some side effects. The metaclass will have no way
to know whether a reload is taking place, except by whether there is
already a symbol of the same name as the class in the module. When a
module is reloaded, the existing version of the class will still be in
the module's dictionary when the new version is being defined. So,
the metaclass will check for the existing class, and then update that
existing class instead of replacing it.
Name Collisions
---------------
The downside to this approach is that it can be fooled into thinking a
reload is taking place, if an object of the same name already exists
in the module at that point in time. For example, this is perfectly
legal Python code, but will not work the same way once the metaclass
is used::
from somewhere import SomeItemClass
class SomeItemClass(SomeItemClass):
def foo(self):
return self.bar
Without the metaclass, this does exactly what it looks like it does -
it creates a ``SomeItemClass`` subclass of
``somewhere.SomeItemClass``. But *with* the metaclass, this will
*overwrite* ``somewhere.SomeItemClass`` with the contents of the new
class, because the metaclass will think you are reloading the module.
Actually, in this simple example, the metaclass could check the
__module__ of the class in question, and give you an error message.
The error would occur even at initial import, and you'd quickly change
your code to something like this:
from somewhere import SomeItemClass as _SomeItemClass
class SomeItemClass(_SomeItemClass):
def foo(self):
return self.bar
which would immediately fix the problem. However, if you do something
like this:
class SomeItemClass(schema.Item):
pass
class SomeItemClass(SomeItemClass):
pass
there is no way to detect the problem, at least if we also allow
changing a class' inheritance tree when code is reloaded. If we
require a class' base classes to remain the same across reloads, then
we could detect this error by virtue of the different inheritance, and
we could again give you an error message so you'd change your code.
This is probably the best option, although it prevents you changing a
class' bases without restarting Chandler. I would expect base class
changes to be rare, however, so this is probably an acceptable
convenience vs. safety tradeoff. I propose the error message for any
of the above collisions to read something like:
NameError: SomeItemClass already defined in module blah.blah;
please rename either the existing class or the new class
And it would occur as soon as the name collision exists, not just at
reload time. However, if you only introduce the collision between
reloads, then of course it will occur when you reload.
The metaclass would be called ``schema.ReloadableClass``, so if you
need to use it in a non-Item class, you would do something like::
class MyArbitraryNonItemClass(SomeBase):
__metaclass__ = schema.ReloadableClass
And the same name collision rules would apply as for item classes.
Reloading Functions
-------------------
To support reloading of module-level functions, there will be a
``schema.reloadable`` decorator, used as follows::
@schema.reloadable
def some_function(some_arg, other_arg, ...):
# whatever
The purpose of this decorator is to allow a function to be updated
in-place, even if another module has already imported it. The only
time you would use this is if you are changing the function and want
to reload it. In other words, the function would normally look like
this::
def some_function(some_arg, other_arg, ...):
# whatever
If you need to change the function while Chandler is running, then you
would add the [EMAIL PROTECTED] line, make the change, and reload
the module. But, before you check your changes back in to Subversion,
you should remove the decorator, just as you would remove debugging
prints. It's strictly a development tool, needed only for top-level
functions, and only ones that you're editing while Chandler is running.
There are some rather strict limitations on what this decorator can
do, by the way. It must be the "outermost" (first) decorator for a
given function, and any nested decorators must preserve the function
name in any transform. You won't be able to add new required
arguments, or rename the previous arguments. However, these kinds of
changes are unlikely to be the sort you could make without restarting
Chandler anyway.
The most likely place where you'd need to use this decorator right now
is on ``installParcel()`` functions that are defined in one module,
but *used* in another via importing. This would also apply to utility
routines defined in one module, but imported in another module for use
by an ``installParcel()`` function. For example, if you have a parcel
that does this:
from some.where import createMenus
def installParcel(parcel, oldVersion=None):
createMenus(parcel)
You would need to add the [EMAIL PROTECTED] decorator to the
``createMenus()`` function definition in ``some.where`` if you wanted
to change ``createMenus()`` without restarting Chandler. (Of course,
you would then also need to reload the parcels that are using the
``createMenus()`` function, which is the subject of the next section
of this proposal.)
Updating Parcel-Defined Items and UI
====================================
Merely reloading a Python module doesn't affect what items are in the
repository, even if you've edited the ``installParcel()`` function or
a utility function it calls. So, there needs to be a way to reload a
parcel and update the items it contains.
Luckily, the mechanisms normally used in ``installParcel()`` should
update existing items in-place, so really the only special thing that
needs to be done to allow updating on-the-fly is providing a way to
re-invoke ``installParcel()``.
My current thought is that the way to expose this API would be to add
a ``reload()`` method to ``schema.ns()``, e.g.::
pim = schema.ns('osaf.pim', view)
pim.reload() # reload the osaf.pim parcel (but not subparcels!)
This would perform a reload of the module (and the package, if the
parcel is a package), and then reinvoke the ``installParcel()`` for
the parcel, to reload the items. Since this would also take care of
reloading code, this would probably be the thing to run to update a
changed parcel. Someone could perhaps provide a test-menu option to
do this, that would ask for the parcel name. Of course, it could also
be done by just dropping into a PyShell. Users of the 'headless'
utility, or those running Chandler under a debugger, could also invoke
the operation directly.
This feature will *not*, however, handle general updates to the
repository schema. In fact, only one kind of schema change will be
supported: adding new classes. If you add a class to a parcel and
reload it -- assuming you've done the import in __init__.py, if needed
-- then the new kind will become available. Changes to existing
classes will be ignored, unless you recreate the repository. Which is
why the next section will talk about...
Updating Chandler Schema
========================
"Do you, Programmer, take this Object to be part of the persistent
state of your application, to have and to hold, through maintenance
and iterations, for past and future versions, as long as the
application shall live?"
"Erm, can I get back to you on that?"
-- from "Making a class serializable",
http://www.erights.org/e/StateSerialization.html
In general, schema evolution is a hard problem. So what I'd like to
do here is first lay out some background to show just *how* hard, and
then backpedal a bit to what more specific goals I think are
achievable with what we're doing in 0.6 and 0.7.
Schema Additions
----------------
But first, something simple. Additive changes to the schema are
relatively easy compared to other kinds of change, since they can
sometimes be done without changing existing items. In fact, adding
new kinds can be done without even restarting Chandler, as we saw in
the previous section. This is especially nice in that it means we'll
be able to download and install new parcels while Chandler is running
- but upgrading an already-installed parcel will require a restart for
stability.
Adding new attributes to existing kinds is a little trickier, because
right now the schema API doesn't scan a kind's attributes if the kind
already exists in the repository. But we could add something that
would check a parcel's version and do a thorough re-scan of every kind
defined by the parcel, whenever the parcel version changed. This
would be part of an at-startup check of parcel versions.
The major complication introduced by adding attributes is attributes
that should have a value for existing items in the repository. In
terms of repository stability, this is not a big deal, as the
repository doesn't care that the attributes are missing unless they
are marked ``required``, and you run ``check()``.
However, for application functionality, it means that new versions of
parcels must either:
1. Never assume an attribute exists, unless it was supplied and
initialized by the first public release of the parcel, and *every
release since*. Or,
2. Use ``defaultValue``, so the attribute always appears to have a value
The downside of option 1 is that you have to keep track of what you
released, and when you changed it, "through maintenance and
iterations, for past and future versions, as long as the application
shall live." The downside of option 2 is that the attribute can never
*not* have a value, and there may be other limitations associated with
``defaultValue``, which we have mostly not been using for some time.
Note that ``defaultValue`` is different from ``initialValue``. An
``initialValue`` is set when an item is created. If you later delete
the attribute, the ``initialValue`` does not come back. Similarly, if
you add a new attribute with an ``initialValue``, or change the
``initialValue`` of an existing attribute definition, this does not
affect already-created items, even if they don't have a value for that
attribute.
Incidentally, this is somewhat related to the issue that we've
sometimes had with ``initialValue``, in that we would often like an
attribute's initial value to be computed, rather than a constant. For
example, creation and modification dates want to default to the
``datetime.now()`` at the time the item is created. It may also be
that we would like to be able to have some code run for existing items
and set a computed initial value, when a new attribute is added.
One way we can accomplish this is to have the relevant
``installParcel()`` function include a block of code like::
def installParcel(parcel, oldVersion=None):
# ...
for item in SomeChangedClass.iterItems(parcel.itsView):
if not hasattr(item,"newattr"):
item.newattr = some_calculation(item)
Since ``installParcel()`` is only invoked when a parcel is installed,
upgraded, or explicitly reloaded, this operation would be reasonable
in many cases, especially since it will not do any work when the
parcel is first installed (because there will be no items of the
changed class yet).
However, for upgrades and reloads, it could possibly be quite slow,
and might need some way to display or update a progress meter. But
the mechanism for this needs to somehow be decoupled from the schema
API and the standard Chandler UI, because it also needs to work when
run under ``headless``, and of course unit tests need to work too.
Oh, and don't forget - you can't ever remove that upgrade code from
``installParcel()``, unless of course you stop using the attribute.
Ah, if only all schema evolution issues were as simple as additions! :)
Moves and Renames
-----------------
Additions, alas, are not the only kind of schema changes we're likely
to have in future versions. It's extremely likely that in 0.7 we'll
be doing a lot of moves and renames to finish our parcel/package
flattening and the move to a standardized layout for API packages.
But, we currently use the names and locations of modules, classes, and
attributes to synchronize our schema definition with the schema stored
in the repository. This means that if we move a class around, or
rename it, it no longer has an identifier matching that of the Kind in
the repository. So, even if we made no *actual* change to the schema,
we can completely trash someone's existing data just by moving or
renaming things in the normal course of refactoring.
Indeed, the repository stores in each Kind a reference to the class
that implements it, so even if we grabbed the existing Kind and tried
to move or rename it in our ``installParcel()`` routines, we would get
an error. Even though the old class doesn't exist any more, the
repository would try to load it because it's still referred to by the
existing Kind. Andi says this particular issue can probably be worked
around in the repository, but it's only the tip of the iceberg here.
The real issue here is *identification*. How do we uniquely identify
a class or attribute or parcel, once it has moved? One possibility is
to give each such object a list of all paths that it previously lived
in in other versions, so that it could check all those locations and
then move the relevant item to the new location. Of course, such a
list could grow longer over time, and can never be removed.
Another possibility is to assign fixed UUIDs to the items. If you
moved a class, attribute, or parcel, you would first need to find out
its current UUID, and add that information to the source code. Then,
when you move or rename an item, its UUID would move with it, and
remain in sync.
How would we initially assign UUIDs? Well, there is a kind of UUID
called a "namespace UUID" which would be useful for this purpose. A
namespace UUID is generated by hashing a name string with a base UUID,
to create a new UUID. This would allow us to automatically generate
fixed UUIDs for items based on their name, so that it won't be
necessary to manually assign individual UUIDs for every parcel, class,
and attribute. But when you move or rename something, you would need
to find out its assigned UUID, and add it to the code, or else you'll
be creating a new item and abandoning the old one.
Error checking, of course, is a problem with this scheme, since just
renaming or moving something isn't going to move the UUID. Also, if
in one upgrade you rename class X to "Y", and then in a later upgrade
you create a new "X", the new "X" will be assigned the same UUID that
the original X was, and now you have to detect the collision.
Alternately, we could require that parcels and classes be manually
assigned UUIDs, but we could allow attribute UUIDs to be automatically
generated. It's easier to crosscheck a single class' attributes for
naming collisions in the "rename X to Y and create a new X" case, and
for everything else it guarantees that the UUID will move with the
parcel or class, and ensure continuity.
At Chandler's current size, that's a few hundred UUIDs we would need
to manually assign. The process would simply require running a tool
like uuidgen to generate the UUIDs, and then adding them to the
'kindInfo()' for each class, and a __parcel_id__ = "..." assignment in
the parcel's main module. It also adds a step to creating a new kind
or parcel, but not a particularly difficult one.
Changes and Deletions
---------------------
The next type of schema change that can occur is changes to metadata.
For classes, clouds, and parcels, metadata changes are fairly
harmless, as they don't usually affect the user's data in any way.
For attributes, however, changes to metadata like type or cardinality
could require changing all existing values of that attribute. Such
changes would actually require creating a new attribute and copying
the old values over to it, then deleting the old attribute. And it's
not immediately obvious how we'd go about doing that.
Deletions from the schema are likely to be rare, at least when viewed
in the "upgrade" direction. But we may also need to consider the
"downgrade" case, where someone wants to revert to a previous version
of a package, and therefore needs to undo schema changes. Implicitly
this can involve removing some part of the schema that was added,
although in practice there is no actual need to remove it, since it
will be inaccessible. It does mean, however, that the upgrade
mechanism will eventually need to be robust in the face of repeated
upgrades.
Analysis
========
Providing robust support for schema changes is an "interesting"
problem. Once a schema is officially released, it appears that
constant vigilance will be required, to ensure that changes always
provide a migration path for users' data. We do not currently have
any ways to validate changes made between a particular pair of schema
versions, nor to track the changes made (other than indirectly, via
source code changes).
With sufficient care and infrastructure support, we can relatively
easily support manual schema upgrades, in the sense of having
installParcel() make the changes, if we entirely forbid certain
classes of schema change that could not be implemented in this way.
However, the amount of developer care required currently appears
prohibitive, in the sense that it's going to seriously impede our
flexibility to refactor.
In order to remove these impediments, we would have to have some way
of automatically tracking changes to the schema, as a kind of revision
log that would be kept alongside the code in Subversion. When
changing between versions of a parcel, it should be possible for the
system to automatically apply the relevant changes to both the schema,
and the corresponding data. This logging system would then also be
able to prohibit (via error messages) any changes that could not be
supported without extending the revision system. The actual tracking
mechanism will probably need to use some of the techniques described
here:
http://citeseer.ist.psu.edu/staudtlerner96model.html
There would also need to be some tools to manipulate the log. For
example, to mark a point in the log as corresponding to a particular
release version, to simplify applying changes. Or to list the changes
between releases, etc. I'm not going to attempt to fully specify the
system at this time, since I don't think we can reasonably include
something like it in 0.6. We may be forced to simply require that 0.6
-> 0.7 upgrades go through an export-and-import process.
My original intent with this proposal was to try to support some
minimal schema versioning support in 0.6, but as the analysis
progressed it has become apparent that, given the complexity and the
lateness of the date, it's going to be simpler to just introduce
parcel versioning in 0.7 alongside the introduction of Python Eggs,
since eggs include version metadata already, and they provide a
natural boundary for the schema revision tracking system described above.
In short, I think the only part of these proposals that can reasonably
be implemented for 0.6 are the parts to support reloading code and
parcel-defined items, which should be helpful to developers working on
code changes that currently require a Chandler restart before testing.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Open Source Applications Foundation "Dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/dev
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Open Source Applications Foundation "Dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/dev