If I understand correctly, the XML Information Set [1] dictates what is
supposed to be preserved as the information content of the xml document by a
generic processing component.  It is not completely prescriptive ("does not
attempt to be exhaustive... nor does it constitute a minimum set of
information that must be returned...")  The Infoset includes some of the
content of a DOCTYPE element.  For many documents, the Infoset is not
sufficient to produce an exact duplicate of the original input when the
document is recreated from the parsed form, so exact roundtripping cannot be
expected.  Appendix D of [1] says what is missing.

If the document is validated according to an XML Schema, further information
may be retained (the PSVI or post schema validation infoset).  Currently,
this is academic, since I believe Xindice has problems with documents
specified by XML Schemas.  It is a concern for the future.

<Opinion>I would regard it as a requirement of a database of XML documents
that at least a canonical form of every document could be idempotently
inserted and extracted from the database.  I would further require that any
validation constraints (such as element content models in the internal
subset), processing instructions, comments acting as documentation for
humans, parsed entitites and more should be part of this canonical form.
This already goes beyond the Infoset specification.

At minimum, Xindice should adhere to the Infoset, and carefully document
what is not preserved.  This will limit the class of XML documents that can
be meaningfully roundtripped, but at least users will be able to know what
to expect.  If the Infoset were the standard for the canonical form of an
XML document, you would not be able to roundtrip  a document whose doctype
declaration included an internal subset.
</Opinion>

[1] http://www.w3.org/TR/xml-infoset/

Jeff
----- Original Message -----
From: "Sixten Otto" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Tuesday, April 09, 2002 11:03 AM
Subject: Re: DTD expansion?


> At 04:07 PM 4/9/02 +0000, Murray Altheim wrote:
> >Now, the prolog is not considered part of the document instance,
> >which is what most people call a "document." It declares information
> >necessary to process the instance, but its content is not part of
> >the instance.
>
> Here's something I guess I should clarify for myself, so I know whether my
> expectations are unreasonable from the get-go. If the prolog (and thus the
> DOCTYPE declaration) is not considered part of the instance, does that
make
> it valid behavior for the parse-and-save to strip it out when I add the
> document to the Xindice repository?


Reply via email to