I'll try to answer both your question and Jeff Greif's, who replied
regarding what is demanded by the InfoSet. I have my own opinion on
this that doesn't necessarily agree with the opinion of the W3C, and
while being at odds with the W3C may be seen as rather like being at
odds with God, though I'm happy to say that I'm in pretty decent
company lately [those who don't like to listen to heretics please
close your viewing windows now...]
Sixten Otto wrote:
At 04:07 PM 4/9/02 +0000, Murray Altheim wrote:
Now, the prolog is not considered part of the document instance,
which is what most people call a "document." It declares information
necessary to process the instance, but its content is not part of
the instance.
Here's something I guess I should clarify for myself, so I know whether
my expectations are unreasonable from the get-go. If the prolog (and
thus the DOCTYPE declaration) is not considered part of the instance,
does that make it valid behavior for the parse-and-save to strip it out
when I add the document to the Xindice repository?
There are those who now believe in the InfoSet as the bible for how all
XML processes should be performed. What seems obvious is that there is
no single model for how all XML content should be processed, and how
XML content is managed in an XML database is a sort of "metacategory"
since the Xindice repository introduces issues that no single XML
document has to deal with. The InfoSet introduces as many problems
as it solves. Because Xindice is a somewhat unusual application in
that its database is one big document, those things that don't fit into
the InfoSet mode of processing don't have an easy fit into Xindice.
This is one reason why I developed the XNode API. I wanted to preserve
metadata such as the prolog information, because unless one stores the
content as a big CDATA section (which rather eliminates its usefulness
as XML while in the database), it's necessary to eliminate those parts
of a document that can't be stored *as a node within a larger document*.
There might (for example) be a need to store the markup declarations
occurring in an internal subset in order for a document to be correctly
processed -- how would this be accomplished? (actually, making
functional "redeclarations" of those found in the subset is something
rather interesting, ie., a translation of DTD grammar to RELAX would
suffice quite well in this instance).
That is, is this one big bug (DOCTYPE replaced by the comments), or a
bug and a misunderstanding (incorrectly inserts comments, but correctly
removes DOCTYPE)? If I were to rebuild Xindice with a custom subclass of
XmlSerializer, as you suggest, would I still find that the prolog goes away?
(Yeah, that's probably a pretty rank newbie question. :-( )
Well, it's a commonly misunderstood question even among those who've
spent a lot of time with XML, as there's been a lot of confusion from
the W3C on the issue, i.e., they've struggled since the beginning in
dealing with XML's history as SGML, wishing to ignore or kill it;
"DTD" is a dirty word there, as are ENTITY and NOTATION declarations,
etc. They've gone to a lot of trouble to reinvent many of the
existing XML features (such as their XInclude, which is really already
available using entities and catalogs and has been functional in
SGML systems for many years). Requests for features that would allow
DTDs to compete on a level playing field (such as allowing notation
declarations on attribute content, which is now allowed in SGML)
have been ignored. Not invented here, apparently.
Since Xindice can use processors that are namespace-aware, there's
a number of ways of preserving DOCTYPE (and any processing
instructions that happen to be present, such as those indicating
XML Schemas or stylesheets). My approach with XNode was to create
an extensible means of storing named properties, so that this type
of content had a place in the datebase. Now we likely need to
standardize the property names in order to interoperably share
databases or nodes, but maybe this can wait a bit since sharing
databases is not (at least insofar as I understand) a current
requirement.
I will be releasing the code to the XNode API within the week, and
sometime this Spring an implementation within my application. Watch
the thread on XNode for further details. The API documentation is at
http://kmi.open.ac.uk/projects/ceryle/docs/api/org/apache/xnode/package-summary.html
Murray
......................................................................
Murray Altheim <mailto:m.altheim @ open.ac.uk>
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK
In the evening
The rice leaves in the garden
Rustle in the autumn wind
That blows through my reed hut. -- Minamoto no Tsunenobu