[Xerces-2]: schema parsing design; discussion starter [long]

neilg Tue, 31 Jul 2001 09:11:42 -0700
Hi folks,

Folks may have noticed that the number of commits from us Torontonians
has gone down, particularly over the past two weeks.  We
haven't gone away; we've just been thinking long and hard about
how to integrate schema support into Xerces2.  As those of you
who have looked at Xerces1's schema implementation will know, it's
neither pretty nor efficient.  It's basically a bunch of hacks
built on a jury-rigged foundation.  We're really hoping to do
things right in Xerces2, and that's why we've been slow to start
putting things down.

In this message I hope to provide a bird's-eye--or maybe spy
satelite-camera :-)--view of the kind of thing we're thinking of.

We've got the beginning of a skeleton outline based on these
ideas, but there are lots of details to fill in.  Hopefully this
post will stimulate discussion, and by the middle of next week
we're hoping we can integrate the outline into Xerces2 (which
should have had a beta release by then).  Once we've got a
sturdier skeleton in place, we should all be in a position to
volunteer to fill in portions of the outline.

I should note that this only covers the process of converting
schema documents into internal grammar representations.  We'll
post about the organization of the grammars, GrammarPool,
validation etc.  later.

To the design:

Schemas can of course be composed of several schema documents;
one schema can import, include or redefine others.  Therefore, a
schema document-centric way of parsing schemas really doesn't
seem to make sense; what seems to be needed at the heart of
things is a class whose function is to co-ordinate the
construction of one (or more) schema grammars from a set of
schema documents.  We propose to call this a SchemaHandler.  In a
nutshell, its job is to collect all the documents that we need to
parse, and farm out the parsing to objects which know how to do
it.

In more detail, there should be three phases to this process.
First, given a schema document, the SchemaHandler needs to find
all schema documents that it <include>s, <redefine>s or
<import>s.  Then it needs to do the same thing for all these
schema documents recursively.  The result of this
will be a set of DOM trees, one for each schema document.

As something of an aside, I should note that we've looked in a fair
bit of detail at Xalan's DTM (Document Table Model) in hopes that it
might be a lighter, more memory-efficient substitute for the DOM in
our schema implementation.  We're certainly very much open to the
idea that Xerces should, at some point, acquire the ability to
produce a DTM from an XML document.  But, since it seems critically
important that Xerces2 get schema-support as quickly as possible, we
had to conclude that adding DTM support now would be a bad idea.
This is largely a matter of time:  Not only would we have to
implement a document table etc.--as well as reintroducing the concept
of a StringPool into Xerces2, something that everyone's been trying
to avoid--but we would have had to implement direct support for
XPath as well, since this is at the core of the DTM as the Xalan
community has defined it.  So we thought that, once Xerces2 has
schema support, we could look at adding a DTM facility to the parser
generally and then think about switching the schema parsing component
to use it.

Now back to the main design:
Because each schema document has certain properties that hold
throughout it (namespace bindings, values for elementFormDefault,
blockDefault etc.), we're planning to wrap these DOM trees that
we produce in the first phase of schema parsing in an
object called an XMLSchemaDocument.  Our SchemaHandler will
maintain a list of available XMLSchemaDocuments and will also
keep a record of the relationships between them.

In the second phase of processing, the SchemaHandler will go
through all the children of the roots of all these DOM trees.
The purpose of this operation is to identify all the named global
components we have access to.  The schema spec defines various
symbol spaces for components, and the SchemaHandler will maintain
a table for each symbol space.  Each entry of the table will be
identified with a QName (the localpart of the global component
with the targetNamespace of the schema it came from); the values
of the table entries will be references to the DOM node
corresponding to the declaration.  This should save us a great
deal of time in look-ups.  I should also note that we think
references to redefined components can be handled in this phase
as well.

Once all our global declarations are identified, we'll begin
parsing (traversing) them, starting with the first declaration
from the first schema document we were asked to parse.  We
propose to define a set of Traverser classes, more or less
corresponding to each kind of schema component.  So we propose
to have an ElementTraverser class, a SimpleTypeTraverser, etc.
The SchemaHandler will call each of these traversers as
appropriate when encountering a given DOM node.  Once a node has
been parsed, we'll use one of the DOM node's flags to indicate
that it has been parsed, so that we can easily skip over it if we
encounter it later.

When a traverser encounters a reference to a component, it will ask the
SchemaHandler to get the required information.
If the component as been parsed, the SchemaHandler will look up
the information in the grammar.  (Here I should point out that we
intend SchemaGrammar objects to have a one-one correspondence
with targetNamespaces.  That is, if a schema is encountered that
<import>s another schema, we'll end up producing two different
grammars.)  If the information is not in the grammar, the
SchemaHandler will locate the DOM node containing the relevant
declaration, determine if components of the schema currently
being parsed are allowed to access this component, and call the
relevant traverser for that component to provide the information.

This approach should allow us to localize knowledge about how to
parse a given kind of schema component in a specific object.  We
envision needing a series of helper classes to handle common
things like DOM traversal operations, and perhaps to hold
information that multiple traversers will need to access.  We're also
trying to structure these classes so that object creation is
minimized; e.g., we expect that one instance of a SchemaHandler object
should be able to be used by a particular instance of a parser,
however many schema documents it needs to parse.

But, since the interaction between schema components can be very
complex, we certainly have many details yet to work out.
Nonetheless, I'm hoping that this outline won't confuse anyone,
and will get some discussion going on how to do things right, now
that we have all this experience in implementing this large and
complex specification.  At all events, we want to avoid saddling
Xerces2 with a schema implementation as inefficient and
unmaintainable as that which Xerces 1 ended up with.

Cheers,
Neil


Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  416-448-3519, T/L 778-3519
E-mail:  [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
[Xerces-2]: schema parsing design; discussion starter [long]

Reply via email to