DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14378>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14378 Error parsing XML document with a leading white space character. [EMAIL PROTECTED] changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution| |INVALID ------- Additional Comments From [EMAIL PROTECTED] 2002-11-11 04:33 ------- This is part of the XML specification and all conforming processors must have this behavior. The spec says: 2.1 Well-Formed XML Documents [Definition: A textual object is a well-formed XML document if:] 1. Taken as a whole, it matches the production labeled document. 2. ... Productions: [1] document ::= prolog element Misc* [22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)? [23] XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>' As you can see, there is no whitespace permitted in these productions before the XML declaration. An XML declaration, if present, must be the very first thing in the document and the very beginning of the XML declaration is the literal character sequence '<?xml'. The rationale for this is well described in Appendix F: "The XML encoding declaration functions as an internal label on each entity, indicating which character encoding is in use. Before an XML processor can read the internal label, however, it apparently has to know what character encoding is in use--which is what the internal label is trying to indicate. In the general case, this is a hopeless situation. It is not entirely hopeless in XML, however, because XML limits the general case in two ways: each implementation is assumed to support only a finite set of character encodings, and the XML encoding declaration is restricted in position and content in order to make it feasible to autodetect the character encoding in use in each entity in normal cases." Therefore, the restriction that the XML declaration appear first in the document is quite intentional. As to having a clearer error message, the confusion comes from the syntax of the XML declaration being significantly similar to the syntax for processing instruction. There is no code to recognize an XML declaration anywhere other than at the very start of the document entity because that is the only place where it is allowed to occur. What is legal at that point in the document are processing instructions, and the parser sees the '<?' and dispatches to the code to parse the syntax of a processing instruction, which is: [16] PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>' [17] PITarget ::= Name - (('X' | 'x') ('M' | 'm') ('L' | 'l')) As you can see, the processing instruction target is specificly prohibited from being 'xml' and an appropriate error message is emitted that reflects that restriction. In any case, continually reopening this same defect it not the best way to have a discussion of such questions about the XML spec or the behavior of Xerces as an implementation of that spec. There are mailing lists for such discussions, some of which are already getting copies of this exchange in Bugzilla, something that is considered generally to be quite inappropriate. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
