Sixten Otto wrote:
On Tuesday, April 9, 2002, at 04:56 AM, Murray Altheim wrote:
Another option would be to preprocess the DocBook DTD using James
Clark's SP tools, which could strip the comments for you and reduce
the size of the DTD used during processing.
This part seems directed at me, since I'm the one dealing with DocBook.
yes.
This is not a bad idea, but doesn't really solve the problem, which is
that these comments are appearing in my documents in place of the
DOCTYPE declaration that should be there. Granted, maybe I'm not
familiar enough with the formal XML specs, but this certainly doesn't
seem a valid transformation for the round-trip through Xindice to be
making (though, again, it seems like Xerces' fault). Even if I could
strip out or ignore all of the comments, I still wouldn't be getting
back the same document.
I'm not sure why this happening, but let me explain one bit about
DTDs and such. In SGML and XML the DTD is considered part of a
document instance's *prolog*, which is in markup essentially
anything that occurs before that beginning "<" of the document
element. The prolog contains the DOCTYPE declaration which points
to the external declaration subset (the DTD) or any internal subset
containing other declarations (a square-bracket bounded area which
optionally occurs just before the closing ">" of DOCTYPE, which are
actually processed prior to the external subset).
Now, the prolog is not considered part of the document instance,
which is what most people call a "document." It declares information
necessary to process the instance, but its content is not part of
the instance. So why the comments or any part of the external subset
should show up in a document instance points to a bug in Xerces, or
maybe even more likely, the serializer used to provide the markup to
Xindice.
The more I think about this it seems the serializer is at fault, and
I've for my own project written my own XML serializer because I had
so many problems with Xerces' one, which is probably why I've not
run into this problem (and I use DocBook). I'll bet the Apache XML
serializer is passing on the comments it finds in the DTD into the
writer used.
What you could do is subclass the org.apache.xml.serialize.XmlSerializer
and kill its ability to write comments. Or report this bug to the
Xerces 2 team, which have been pretty responsive in the past when
told about these things.
Murray
......................................................................
Murray Altheim <mailto:m.altheim @ open.ac.uk>
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK
In the evening
The rice leaves in the garden
Rustle in the autumn wind
That blows through my reed hut. -- Minamoto no Tsunenobu