Hi everyone.

I have been researching details of how we'll use XML with Python for the 
Distro-Constructor.  I have looked into different schemas and different 
parser types.

Schemas
=======

I have been researching the viability of various schemas schemas for 
defining/validating the (XML) Distro-Constructor manifest.  Contenders 
were DTDs, W3C schemas and RelaxNG schemas.  I wanted the validator to 
be something which is included with Solaris.  I used xmllint (which 
comes as part of libxml2) as a validator, to test whether validation of 
a particular schema is supported in solaris.

W3C
---

First off, it appears that W3C schema support isn't fully implemented.  
xmllint will validate the schema itself (showing grammatical errors), 
but won't call out errors in the XML document.  For example, when I 
specified in the schema that a particular element was required and then 
deleted that element from the XML doc, xmllint still said the XML 
validated.  Here is my simple example:

schema file:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema";>
        <xs:element name="treeroot"/>
        <xs:attribute name="numval" type="xs:integer"/>
        <xs:complexType name="compType1">
                <xs:sequence>
                        <xs:element name="reqd1" type="xs:string"
                            minOccurs="1" maxOccurs="1"/>
                        <xs:element name="opt2"
                            type="xs:string" minOccurs="0" maxOccurs="1"/>
                        <xs:element name="multireqd3" type="xs:string"
                                minOccurs="0" maxOccurs="unbounded"/>
                </xs:sequence>
        </xs:complexType>
</xs:schema>

XML doc:

<treeroot numval="123">
        <xopt2>"opt2"</xopt2>
</treeroot>

output:

$ xmllint --schema test2_schema test2.1.xml
<?xml version="1.0"?>
<treeroot numval="123">
        <xopt2>"opt2"</xopt2>
</treeroot>
test2.1.xml validates

I was expecting to see that reqd1 was missing...

DTDs
----

DTDs work, but they are not that flexible.  Most docs which I have read 
suggest using a schema instead of a DTD.  DTDs lack type and pattern 
checking of data inside of elements, and only limited checking of 
attributes.  While this checking could be done after reading in the 
data, it makes the most sense to have the data valicated before it gets 
to the application.. It is less work for the application, since it can 
make assumptions about the data (what is an
int vs a string, for example).  Another good reason to ditch DTDs is 
that they are harder to read than schemas.  DTDs use a kind of BNF 
grammar, whereas schemas have a nested, more XML-like syntax.

RelaxNG
-------

Of the three, RelaxNG schemas seem to to be the best option:
- They are reasonably simple.
- They are flexible.  In fact, one can implement a superset of W3C documents
using RelaxNG.
- They are reasonably easy to read (better than DTDs).
- They work on Solaris.

If W3C schemas worked on Solaris it would be a close call.  W3C schemas 
would do more strict checking (which is what this is all about anyway).  
I also believe they are slightly easier to read.  That said, validation 
using W3C schemas doesn't work on Solaris, so it's a mute point.

Parsers
=======

There are two classes of XML parsers for Python: SAX and DOM.  SAX, or 
"Simple API for XML", is a simple parser which hands off the bits and 
pieces of an XML document as it reads them, to user-written callbacks 
which can process the data as they see fit.  It is up to these callbacks 
to build any internal data structures to store them.  DOM, or "Document 
Object Model", builds an internal tree containing the data and provides 
functions to extact the data.

SAX is very flexible in that it leaves to the application to decide 
which XML information gets used.  The app can optimize memory usage and 
performance based on how it stores the data.  It is more complicated to 
use, however, for precisely the same reasons:  the app has to do more to 
store the data.

The flip side of this is DOM.  The main complaint of DOM is that it can 
use lots of memory for large XML files, because it creates its own data 
tree, and stores all of the data it reads.

Of SAX and DOM, I suggest DOM better suites our purpose.  DC's manifest 
is bounded and fairly small, so memory won't be an issue.  DOM also 
provides routines to manipulate the data tree, allowing for layering.  
Several manifests can be read in, and their data updated/appended in a 
main data tree.  Finally, DOM allows for writing out a tree into a new 
XML file.  This allows for a tree to be customized, and a new XML file 
containing all customizations to be written and saved.

These are my findings.  If anyone has anything to add, comment on or 
question, please let me know.

    Thanks,
    Jack

Reply via email to