re: bug 14531, XML Serialization and standalone

Glenn Marcy Thu, 14 Nov 2002 07:12:51 -0800

By default, Xerces is a conforming "validating XML processor",
i.e. it always does what the XML specification requires of
validating processors.  This is true even if the validation
feature is set to false.  It is a common misconception that
this feature turns Xerces into a "non-validating XML processor"
but this is not the case.

The XML spec, in Section 5.1, states:

  [Definition: Validating processors must, at user option, report
  violations of the constraints expressed by the declarations in
  the DTD, and failures to fulfill the validity constraints given
  in this specification.] To accomplish this, validating XML
  processors must read and process the entire DTD and all external
  parsed entities referenced in the document.

The "at user option" is what the "validation" feature controls.
When you have validation set to false, Xerces stops reporting
validity constraint failures, it does not stop following any of
the other requirements of validating processors, it still is one.

That said, there are several other features that can be used to
change the behavior of Xerces to follow that of a non-validating
XML processor.  You can tell it to not read external entities,
either general, parameter or the external DTD subset.  When you
use these features you are explicitly directing the parser to
stop conforming to the requirements of a validating processor.

Why have this behavior?  The information that is reported to the
application is the "infoset" of the document.  When you start to
build more complex operations like schema validation, XSLT
transformations, etc., they rely on that infoset to perform
their operations.  The "most complete" infoset for a document is
obtained by reading all of the entities referenced by that
document, including all of the markup declarations.  When you do
not read these entities you produce a different infoset, which
could change the behavior of the other operations that depend
upon that infoset.  This seems undesirable and so we try to give
the application the most complete infoset we can unless the
application tells us to do otherwise.

Getting back to the case of the standalone document, there is
a slight inconsistancy in the description.  It says that when
the document is standalone that there are no markup declarations
that effect the "information" passed to the application.  It
is probably the case that information is meant to cover things
like character data, attribute values, etc.  But what about the
declarations themselves?  When I have a SAX application that
registers a handler to receive markup declarations should I not
receive those callbacks when a standalone document references
an external DTD just because those declarations will not change
the other information passed to the application?

Clearly there is no one answer that is right for everyone.  In
recognition of this we try to obtain all of the information that
we can report to the application and provide features to allow
the application to limit that information to meet its needs.
It sounds like you may have hit upon a new one, which is to not
read external entities in the DTD when a document says that it
is standalone.  This would produce a different infoset than you
would get if you read those declarations, but I would agree that
if this is an acceptable behavior for the application then it
would be possible for Xerces to support such a feature.

Regards,
Glenn

Simon writes:

Hi,

I recently raised a bug against xerces' XMLSerializer class regarding
the "standalone" attribute:
 http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14531

Glenn Marcy's comment on this bug has left me rather confused.
I went back to the W3C XML 1.0 spec, and am now even more confused :-)

I am therefore continuing this on the user's list rather than the bug
comments or the dev list.

Possibly this discussion belongs to a general xml-users list rather than
the xerces-users list. If the general consensus here is that the
XMLSerializer class is indeed doing the right thing for the standalone
attribute, I'll take this email to a more appropriate discussion list.

Glenn, I have CC'd you directly on this in case you are not on the
user's list and are willing to help me out here. I will leave you off
any future emails on this topic unless you indicate otherwise.

--------------------------------------

The original bug raised by me [excerpt]:

> Currently, the XMLSerializer class outputs the "standalone"
> attribute of the <?xml ...?> prolog if-and-only-if the public
> and system identifiers being output in the DOCTYPE tag are null.
>
> It seems to me to be perfectly valid to have standalone="yes"
> AND public/system IDs. The relevant section of the xml spec is:
? http://www.w3.org/TR/REC-xml#sec-rmd
> No mention is made here of forbidding standalone=yes when a
> DTD ID is given in the DOCTYPE tag.

--------------------------------------

Glenn Marcy commented on the original bug:

>> standalone="yes" --> the DTD must be read if-and-only-if validation
>> is enabled. (ie DTD can be ignored if validation disabled, a good
>> optimisation!)

> This is incorrect.  The document might not actually be
> standalone, which would only be a failure of a validity
> constraint, which a non-validating processor would not check.
>  Therefore, the document can contain references to external markup
> declarations that change the infoset of the document, like default
> values for attributes that are not specified.  If a non-validating
> processor reads those declarations then it is obligated to act on
> them.  The fact that the standalone declaration is in error does not
> change this.
>
> Now obviously a non-validating processor is not obligated to read
> external markup declarations at all, but Xerces already has features
> defined to control this behavior.  There is nothing in the XML
> specification that says that the presence of standalone="yes"
> should cause non-validating processors to change
> their behavior with respect to reading external entities.

I think we have very different interpretations of what "standalone"
means - which probably means mine is wrong. But what I understand it to
mean, when embedded within a source xml document, is:

"Parser, I *promise* you that there is nothing in the DTD specified in
the DOCTYPE tag (or any other external entity) which will affect the
results of parsing this file. If you are a validating parser, then you
will need to process external entities anyway in order to check the
document syntax, but if you are not a validating parser, then there is
no need to read the DTD."

Section 2.9 of the XMl spec says:

"In a standalone document declaration, the value "yes" indicates that
there are no external markup declarations which affect the information
passed from the XML processor to the application. "

Surely this means that by specifying standalone='yes' and validation=no,
xml parsing will be faster because the parser can completely ignore all
external markup declarations?

Of course, if the XML document containing the standalone='yes' statement
is lying (there are indeed things in the DTD which affect the created
document, like default attribute values) then the result of parsing will
be incorrect.

> This is incorrect.  The document might not actually be
> standalone

But that's not the parser's problem, is it? If I am wrong, I get what I
deserve. And if I write an xml document, and set standalone='yes'
because I know the DTD doesn't define any default attribute values etc,
and I want the performance benefits that come from allowing the parser
to skip the DTD processing, then why should the parser read the DTD
anyway in an attempt to prove me a liar?

Yes, setting "standalone=yes" is therefore a dangerous thing to do; if
the DTD does define something significant then the results of parsing
are incorrect.But that's life, no?

> Now obviously a non-validating processor is not obligated to read
> external markup declarations at all ...

I thought that a non-validating parser still had to read external markup
to determine default attributes, etc. It just doesn't need to report any
violation of the xml structure. That has certainly been my experience
with Xerces in the past; disabling validation then parsing a file with a
DOCTYPE containing a SYSTEM entry gives me errors about being unable to
find the file. In fact I have had to define an EntityResolver which
returns empty DTDs in order to mimic "standalone=yes" behaviour when
parsing XML from our customers where we don't have a copy of the DTD
locally (and don't need one because there are no default attributes etc
in the DTD).

Am I misunderstanding something here? Any comments welcome!

Regards,

Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

re: bug 14531, XML Serialization and standalone

Reply via email to