By default, Xerces is a conforming "validating XML processor", i.e. it always does what the XML specification requires of validating processors. This is true even if the validation feature is set to false. It is a common misconception that this feature turns Xerces into a "non-validating XML processor" but this is not the case.
The XML spec, in Section 5.1, states: [Definition: Validating processors must, at user option, report violations of the constraints expressed by the declarations in the DTD, and failures to fulfill the validity constraints given in this specification.] To accomplish this, validating XML processors must read and process the entire DTD and all external parsed entities referenced in the document. The "at user option" is what the "validation" feature controls. When you have validation set to false, Xerces stops reporting validity constraint failures, it does not stop following any of the other requirements of validating processors, it still is one. That said, there are several other features that can be used to change the behavior of Xerces to follow that of a non-validating XML processor. You can tell it to not read external entities, either general, parameter or the external DTD subset. When you use these features you are explicitly directing the parser to stop conforming to the requirements of a validating processor. Why have this behavior? The information that is reported to the application is the "infoset" of the document. When you start to build more complex operations like schema validation, XSLT transformations, etc., they rely on that infoset to perform their operations. The "most complete" infoset for a document is obtained by reading all of the entities referenced by that document, including all of the markup declarations. When you do not read these entities you produce a different infoset, which could change the behavior of the other operations that depend upon that infoset. This seems undesirable and so we try to give the application the most complete infoset we can unless the application tells us to do otherwise. Getting back to the case of the standalone document, there is a slight inconsistancy in the description. It says that when the document is standalone that there are no markup declarations that effect the "information" passed to the application. It is probably the case that information is meant to cover things like character data, attribute values, etc. But what about the declarations themselves? When I have a SAX application that registers a handler to receive markup declarations should I not receive those callbacks when a standalone document references an external DTD just because those declarations will not change the other information passed to the application? Clearly there is no one answer that is right for everyone. In recognition of this we try to obtain all of the information that we can report to the application and provide features to allow the application to limit that information to meet its needs. It sounds like you may have hit upon a new one, which is to not read external entities in the DTD when a document says that it is standalone. This would produce a different infoset than you would get if you read those declarations, but I would agree that if this is an acceptable behavior for the application then it would be possible for Xerces to support such a feature. Regards, Glenn Simon writes: Hi, I recently raised a bug against xerces' XMLSerializer class regarding the "standalone" attribute: http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14531 Glenn Marcy's comment on this bug has left me rather confused. I went back to the W3C XML 1.0 spec, and am now even more confused :-) I am therefore continuing this on the user's list rather than the bug comments or the dev list. Possibly this discussion belongs to a general xml-users list rather than the xerces-users list. If the general consensus here is that the XMLSerializer class is indeed doing the right thing for the standalone attribute, I'll take this email to a more appropriate discussion list. Glenn, I have CC'd you directly on this in case you are not on the user's list and are willing to help me out here. I will leave you off any future emails on this topic unless you indicate otherwise. -------------------------------------- The original bug raised by me [excerpt]: > Currently, the XMLSerializer class outputs the "standalone" > attribute of the <?xml ...?> prolog if-and-only-if the public > and system identifiers being output in the DOCTYPE tag are null. > > It seems to me to be perfectly valid to have standalone="yes" > AND public/system IDs. The relevant section of the xml spec is: ? http://www.w3.org/TR/REC-xml#sec-rmd > No mention is made here of forbidding standalone=yes when a > DTD ID is given in the DOCTYPE tag. -------------------------------------- Glenn Marcy commented on the original bug: >> standalone="yes" --> the DTD must be read if-and-only-if validation >> is enabled. (ie DTD can be ignored if validation disabled, a good >> optimisation!) > This is incorrect. The document might not actually be > standalone, which would only be a failure of a validity > constraint, which a non-validating processor would not check. > Therefore, the document can contain references to external markup > declarations that change the infoset of the document, like default > values for attributes that are not specified. If a non-validating > processor reads those declarations then it is obligated to act on > them. The fact that the standalone declaration is in error does not > change this. > > Now obviously a non-validating processor is not obligated to read > external markup declarations at all, but Xerces already has features > defined to control this behavior. There is nothing in the XML > specification that says that the presence of standalone="yes" > should cause non-validating processors to change > their behavior with respect to reading external entities. I think we have very different interpretations of what "standalone" means - which probably means mine is wrong. But what I understand it to mean, when embedded within a source xml document, is: "Parser, I *promise* you that there is nothing in the DTD specified in the DOCTYPE tag (or any other external entity) which will affect the results of parsing this file. If you are a validating parser, then you will need to process external entities anyway in order to check the document syntax, but if you are not a validating parser, then there is no need to read the DTD." Section 2.9 of the XMl spec says: "In a standalone document declaration, the value "yes" indicates that there are no external markup declarations which affect the information passed from the XML processor to the application. " Surely this means that by specifying standalone='yes' and validation=no, xml parsing will be faster because the parser can completely ignore all external markup declarations? Of course, if the XML document containing the standalone='yes' statement is lying (there are indeed things in the DTD which affect the created document, like default attribute values) then the result of parsing will be incorrect. > This is incorrect. The document might not actually be > standalone But that's not the parser's problem, is it? If I am wrong, I get what I deserve. And if I write an xml document, and set standalone='yes' because I know the DTD doesn't define any default attribute values etc, and I want the performance benefits that come from allowing the parser to skip the DTD processing, then why should the parser read the DTD anyway in an attempt to prove me a liar? Yes, setting "standalone=yes" is therefore a dangerous thing to do; if the DTD does define something significant then the results of parsing are incorrect.But that's life, no? > Now obviously a non-validating processor is not obligated to read > external markup declarations at all ... I thought that a non-validating parser still had to read external markup to determine default attributes, etc. It just doesn't need to report any violation of the xml structure. That has certainly been my experience with Xerces in the past; disabling validation then parsing a file with a DOCTYPE containing a SYSTEM entry gives me errors about being unable to find the file. In fact I have had to define an EntityResolver which returns empty DTDs in order to mimic "standalone=yes" behaviour when parsing XML from our customers where we don't have a copy of the DTD locally (and don't need one because there are no default attributes etc in the DTD). Am I misunderstanding something here? Any comments welcome! Regards, Simon --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
