Hi everyone,
I've spent some time looking at the XML 1.1 candidate rec and I've noticed
a number of areas within Xerces which seem to behave contrary to the spec.
The issues that I'm bringing up assume the following:
1) NEL (Unicode 0x85) and LSEP (Unicode 0x2028) are not white space
characters, because the S production (http://www.w3.org/TR/REC-xml#NT-S) is
unchanged from XML 1.0 since there's no modification in the 1.1 spec.
2) The amendment to 2.11 End-Of-Line Handling
(http://www.w3.org/TR/xml11/#sec2.11) means that its possible for 0x85 and
0x2028 to occur in the XML declaration before the version of the document
is determined to be 1.1.
3) There's some way to force non-normalized 0x85 and 0x2028 into a document
using references to paramater/general entities, such that they can appear
in replacement text as part of markup in places where they're not allowed
(since they're not white space). For example, between an element name and
attribute list.
4) Section 3.3.3 Attribute-Value Normalization
(http://www.w3.org/TR/REC-xml#AVNormalize) from XML 1.0 is unchanged, so
even assuming that 0x85 and 0x2028 are white space, they shouldn't be
replaced with the 0x20 space character.
In general it looks like Xerces is treating 0x85 and 0x2028 as if they were
white space everywhere, not just in the case described by section 2.11,
meaning if my third and/or fourth assumptions are correct there are cases
where these characters are going to be handled incorrectly. I've attached
some patches to this e-mail for some specific problems I've located.
Patch #1: version-detector
The parser allows 0x85 and 0x2028 to appear in the XML declaration before
it determines that the version of the document is 1.0. Since end-of-line
handling for XML 1.0 documents doesn't include these characters, such
documents must be invalid. XMLVersionDetector consumes some of the input
stream, so needs to do some clever fixup of the entity before the document
scanner gets a hold of it. Unfortunately once the scanner gets the document
entity, any trace of 0x85 and 0x2028 are gone because they were quitely
normalized away. This makes it impossible for the 1.0 document scanner to
detect 0x85 or 0x2028 in places in the XML declaration. It looks like this
detection must be done in XMLVersionDetector.
My fix first assumes 1.0 end-of-line handling, and then switches to 1.1
end-of-line handling in order to try to match the segment of the XML
declaration production '<?xml' S 'version' S? '=' S?. You can use this
approach to indirectly determine if 0x85 or 0x2028 appears in this part of
the document, and then emit an error if it's determined that the document
wasn't version 1.1.
Patch #2 : attribute-normalization
The parser will replace 0x85 and 0x2028 with 0x20 when normalizing
attributes. As per my fourth assumption this isn't legal. My patch reverts
white space replacement to the behaviour in the 1.0 scanner (just
replacement of 0x20, 0xD, 0xA, 0x9 with 0x20).
Patch #3 : D-85-newline-normalization
It seems like the character sequence #x85 #xA is being normalized to #xA
instead of normalizing #xD #x85 to #xA. My patch is just an example for
scanChar (also fixes normalization of 0x2028). Something similar can be
done in other places in the scanner.
Index: XML11EntityScanner.java
===================================================================
RCS file:
/home/cvspublic/xml-xerces/java/src/org/apache/xerces/impl/XML11EntityScanner.java,v
retrieving revision 1.2
diff -u -r1.2 XML11EntityScanner.java
--- XML11EntityScanner.java 12 Feb 2003 17:10:33 -0000 1.2
+++ XML11EntityScanner.java 1 Apr 2003 03:25:25 -0000
@@ -142,12 +142,13 @@
fCurrentEntity.ch[0] = (char)c;
load(1, false);
}
- if ((c == '\r' || c == 0x85) && external) {
- if (fCurrentEntity.ch[fCurrentEntity.position++] != '\n') {
+ if (c == '\r' && external) {
+ int cc = fCurrentEntity.ch[fCurrentEntity.position++];
+ if (cc != '\n' && cc != 0x85) {
fCurrentEntity.position--;
}
- c = '\n';
}
+ c = '\n';
}
// return character that was scanned
Index: XMLMessages.properties
===================================================================
RCS file:
/home/cvspublic/xml-xerces/java/src/org/apache/xerces/impl/msg/XMLMessages.properties,v
retrieving revision 1.17
diff -u -r1.17 XMLMessages.properties
--- XMLMessages.properties 12 Feb 2003 17:10:34 -0000 1.17
+++ XMLMessages.properties 30 Mar 2003 20:38:06 -0000
@@ -27,6 +27,7 @@
QuoteRequiredInXMLDecl = The value following \"{0}\" in the XML declaration
must be a quoted string.
XMLDeclUnterminated = The XML declaration must end with \"?>\".
VersionInfoRequired = The version is required in the XML declaration.
+ XML11SpaceInvalidInXMLDecl = An XML 1.1 end-of-line character (Unicode: 0x85
or 0x2028) was found in the XML declaration. The XML declaration must be well-formed,
and the version value must be "1.1" in order to process these characters as white
space.
SpaceRequiredBeforeVersionInXMLDecl = White space is required before the
version pseudo attribute in the XML declaration.
SpaceRequiredBeforeEncodingInXMLDecl = White space is required before the
encoding pseudo attribute in the XML declaration.
SpaceRequiredBeforeStandalone = White space is required before the encoding
pseudo attribute in the XML declaration.
Index: XMLVersionDetector.java
===================================================================
RCS file:
/home/cvspublic/xml-xerces/java/src/org/apache/xerces/impl/XMLVersionDetector.java,v
retrieving revision 1.7
diff -u -r1.7 XMLVersionDetector.java
--- XMLVersionDetector.java 27 Mar 2003 18:23:17 -0000 1.7
+++ XMLVersionDetector.java 30 Mar 2003 20:14:48 -0000
@@ -120,8 +120,10 @@
protected XMLEntityManager fEntityManager;
protected String fEncoding = null;
-
- private XMLString fVersionNum = new XMLString();
+
+ private int fCachedLineNumber;
+
+ private int fCachedColumnNumber;
private final char [] fExpectedVersionString = {'<', '?', 'x', 'm', 'l', ' ',
'v', 'e', 'r', 's',
'i', 'o', 'n', '=', ' ', ' ', ' ', ' ', ' '};
@@ -180,45 +182,178 @@
public short determineDocVersion(XMLInputSource inputSource) throws
IOException {
fEncoding = fEntityManager.setupCurrentEntity(fXMLSymbol, inputSource,
false, true);
- // must assume 1.1 at this stage so that whitespace
- // handling is correct in the XML decl...
- fEntityManager.setScannerVersion(Constants.XML_VERSION_1_1);
- XMLEntityScanner scanner = fEntityManager.getEntityScanner();
+ // Must assume 1.0 at this stage so that we
+ // can detect XML 1.1 whitespace characters in xmldecl.
+ fEntityManager.setScannerVersion(Constants.XML_VERSION_1_0);
+ XMLEntityScanner scanner = fEntityManager.getEntityScanner();
+ boolean hasSeenXML11WS = false;
+
try {
if (!scanner.skipString("<?xml")) {
- // definitely not a well-formed 1.1 doc!
- return Constants.XML_VERSION_1_0;
- }
- if (!scanner.skipSpaces()) {
- fixupCurrentEntity(fEntityManager,
fExpectedVersionString, 5);
- return Constants.XML_VERSION_1_0;
- }
- if (!scanner.skipString("version")) {
- fixupCurrentEntity(fEntityManager,
fExpectedVersionString, 6);
- return Constants.XML_VERSION_1_0;
- }
- scanner.skipSpaces();
- if (scanner.scanChar() != '=') {
- fixupCurrentEntity(fEntityManager,
fExpectedVersionString, 13);
- return Constants.XML_VERSION_1_0;
+ // definitely not a well-formed 1.1 doc!
+ return Constants.XML_VERSION_1_0;
+ }
+
+ if (!scanner.skipSpaces()) {
+ // Switch to 1.1 scanner.
+ fEntityManager.setScannerVersion(Constants.XML_VERSION_1_1);
+ scanner = fEntityManager.getEntityScanner();
+ cacheCurrentEntityLocation(fEntityManager);
+ hasSeenXML11WS = scanner.skipSpaces();
+ if (!hasSeenXML11WS) {
+ fixupCurrentEntity(fEntityManager, fExpectedVersionString, 5);
+ return Constants.XML_VERSION_1_0;
+ }
+ }
+
+ if (!scanner.skipString("version")) {
+ if (hasSeenXML11WS) {
+ restoreCurrentEntityLocation(fEntityManager);
+ fErrorReporter.reportError(
+ XMLMessageFormatter.XML_DOMAIN,
+ "XML11SpaceInvalidInXMLDecl",
+ null,
+ XMLErrorReporter.SEVERITY_FATAL_ERROR);
+ fixupCurrentEntity(fEntityManager, fExpectedVersionString, 6);
+ return Constants.XML_VERSION_1_0;
+ }
+
+ // Switch to 1.1 scanner.
+ fEntityManager.setScannerVersion(Constants.XML_VERSION_1_1);
+ scanner = fEntityManager.getEntityScanner();
+ cacheCurrentEntityLocation(fEntityManager);
+ hasSeenXML11WS = scanner.skipSpaces();
+ if (hasSeenXML11WS) {
+ if (!scanner.skipString("version")) {
+ restoreCurrentEntityLocation(fEntityManager);
+ fErrorReporter.reportError(
+ XMLMessageFormatter.XML_DOMAIN,
+ "XML11SpaceInvalidInXMLDecl",
+ null,
+ XMLErrorReporter.SEVERITY_FATAL_ERROR);
+ fixupCurrentEntity(fEntityManager, fExpectedVersionString, 6);
+ return Constants.XML_VERSION_1_0;
+ }
+ }
+ else {
+ fixupCurrentEntity(fEntityManager, fExpectedVersionString, 6);
+ return Constants.XML_VERSION_1_0;
+ }
+ }
+
+ scanner.skipSpaces();
+
+ if (!scanner.skipChar('=')) {
+ if (hasSeenXML11WS) {
+ restoreCurrentEntityLocation(fEntityManager);
+ fErrorReporter.reportError(
+ XMLMessageFormatter.XML_DOMAIN,
+ "XML11SpaceInvalidInXMLDecl",
+ null,
+ XMLErrorReporter.SEVERITY_FATAL_ERROR);
+ fixupCurrentEntity(fEntityManager, fExpectedVersionString, 13);
+ return Constants.XML_VERSION_1_0;
+ }
+
+ // Switch to 1.1 scanner.
+ fEntityManager.setScannerVersion(Constants.XML_VERSION_1_1);
+ scanner = fEntityManager.getEntityScanner();
+ cacheCurrentEntityLocation(fEntityManager);
+ hasSeenXML11WS = scanner.skipSpaces();
+ if (hasSeenXML11WS) {
+ if (!scanner.skipChar('=')) {
+ restoreCurrentEntityLocation(fEntityManager);
+ fErrorReporter.reportError(
+ XMLMessageFormatter.XML_DOMAIN,
+ "XML11SpaceInvalidInXMLDecl",
+ null,
+ XMLErrorReporter.SEVERITY_FATAL_ERROR);
+ fixupCurrentEntity(fEntityManager, fExpectedVersionString,
13);
+ return Constants.XML_VERSION_1_0;
+ }
+ }
+ else {
+ fixupCurrentEntity(fEntityManager, fExpectedVersionString, 13);
+ return Constants.XML_VERSION_1_0;
+ }
}
- scanner.skipSpaces();
- int quoteChar = scanner.scanChar();
- fExpectedVersionString[14] = (char) quoteChar;
- for (int versionPos = 0; versionPos < XML11_VERSION.length;
versionPos++) {
- fExpectedVersionString[15 + versionPos] = (char)
scanner.scanChar();
- }
- // REVISIT: should we check whether this equals quoteChar?
- fExpectedVersionString[18] = (char) scanner.scanChar();
- fixupCurrentEntity(fEntityManager, fExpectedVersionString, 19);
- int matched = 0;
- for (; matched < XML11_VERSION.length; matched++) {
- if (fExpectedVersionString[15 + matched] !=
XML11_VERSION[matched])
- break;
- }
- if (matched == XML11_VERSION.length)
- return Constants.XML_VERSION_1_1;
- return Constants.XML_VERSION_1_0;
+
+ scanner.skipSpaces();
+
+ if (scanner.skipChar('"')) {
+ fExpectedVersionString[14] = '"';
+ }
+ else if (scanner.skipChar('\'')) {
+ fExpectedVersionString[14] = '\'';
+ }
+ else {
+ if (hasSeenXML11WS) {
+ restoreCurrentEntityLocation(fEntityManager);
+ fErrorReporter.reportError(
+ XMLMessageFormatter.XML_DOMAIN,
+ "XML11SpaceInvalidInXMLDecl",
+ null,
+ XMLErrorReporter.SEVERITY_FATAL_ERROR);
+ fixupCurrentEntity(fEntityManager, fExpectedVersionString, 14);
+ return Constants.XML_VERSION_1_0;
+ }
+
+ // Switch to 1.1 scanner.
+ fEntityManager.setScannerVersion(Constants.XML_VERSION_1_1);
+ scanner = fEntityManager.getEntityScanner();
+ cacheCurrentEntityLocation(fEntityManager);
+ hasSeenXML11WS = scanner.skipSpaces();
+ if (hasSeenXML11WS) {
+ if (scanner.skipChar('"')) {
+ fExpectedVersionString[14] = '"';
+ }
+ else if (scanner.skipChar('\'')) {
+ fExpectedVersionString[14] = '\'';
+ }
+ else {
+ fErrorReporter.reportError(
+ XMLMessageFormatter.XML_DOMAIN,
+ "XML11SpaceInvalidInXMLDecl",
+ null,
+ XMLErrorReporter.SEVERITY_FATAL_ERROR);
+ fixupCurrentEntity(fEntityManager, fExpectedVersionString, 14);
+ return Constants.XML_VERSION_1_0;
+ }
+ }
+ else {
+ fixupCurrentEntity(fEntityManager, fExpectedVersionString, 14);
+ return Constants.XML_VERSION_1_0;
+ }
+ }
+
+ for (int versionPos = 0; versionPos < XML11_VERSION.length; versionPos++)
{
+ fExpectedVersionString[15 + versionPos] = (char) scanner.scanChar();
+ }
+
+ fExpectedVersionString[18] = (char) scanner.scanChar();
+
+ int matched = 0;
+ for (; matched < XML11_VERSION.length; matched++) {
+ if (fExpectedVersionString[15 + matched] != XML11_VERSION[matched])
+ break;
+ }
+
+ if (matched == XML11_VERSION.length && fExpectedVersionString[14] ==
fExpectedVersionString[18]) {
+ fixupCurrentEntity(fEntityManager, fExpectedVersionString, 19);
+ return Constants.XML_VERSION_1_1;
+ }
+ else {
+ if (hasSeenXML11WS) {
+ restoreCurrentEntityLocation(fEntityManager);
+ fErrorReporter.reportError(
+ XMLMessageFormatter.XML_DOMAIN,
+ "XML11SpaceInvalidInXMLDecl",
+ null,
+ XMLErrorReporter.SEVERITY_FATAL_ERROR);
+ }
+ fixupCurrentEntity(fEntityManager, fExpectedVersionString, 19);
+ return Constants.XML_VERSION_1_0;
+ }
// premature end of file
}
catch (EOFException e) {
@@ -257,6 +392,22 @@
System.arraycopy(scannedChars, 0, currentEntity.ch, 0, length);
currentEntity.position = 0;
currentEntity.columnNumber = currentEntity.lineNumber = 1;
+ }
+
+ // Cache column number, and line number of current entity.
+ // For caching position of XML 1.1 end-of-line characters in the xmldecl.
+ private void cacheCurrentEntityLocation(XMLEntityManager manager) {
+ XMLEntityManager.ScannedEntity currentEntity = manager.getCurrentEntity();
+ fCachedColumnNumber = currentEntity.columnNumber;
+ fCachedLineNumber = currentEntity.lineNumber;
+ }
+
+ // Overwrites column number, and line number of current entity.
+ // For error reporting of XML 1.1 end-of-line characters discovered before
version determined to be 1.0.
+ private void restoreCurrentEntityLocation(XMLEntityManager manager) {
+ XMLEntityManager.ScannedEntity currentEntity = manager.getCurrentEntity();
+ currentEntity.columnNumber = fCachedColumnNumber;
+ currentEntity.lineNumber = fCachedLineNumber;
}
} // class XMLVersionDetector
Index: XML11DocumentScannerImpl.java
===================================================================
RCS file:
/home/cvspublic/xml-xerces/java/src/org/apache/xerces/impl/XML11DocumentScannerImpl.java,v
retrieving revision 1.4
diff -u -r1.4 XML11DocumentScannerImpl.java
--- XML11DocumentScannerImpl.java 9 Dec 2002 18:51:29 -0000 1.4
+++ XML11DocumentScannerImpl.java 31 Mar 2003 03:37:58 -0000
@@ -665,20 +665,6 @@
return dataok;
}
- /**
- * Normalize whitespace in an XMLString converting all whitespace
- * characters to space characters.
- */
- protected void normalizeWhitespace(XMLString value) {
- int end = value.offset + value.length;
- for (int i = value.offset; i < end; i++) {
- int c = value.ch[i];
- if (XML11Char.isXML11Space(c)) {
- value.ch[i] = ' ';
- }
- }
- }
-
// returns true if the given character is not
// valid with respect to the version of
// XML understood by this scanner.
Index: XML11DTDScannerImpl.java
===================================================================
RCS file:
/home/cvspublic/xml-xerces/java/src/org/apache/xerces/impl/XML11DTDScannerImpl.java,v
retrieving revision 1.4
diff -u -r1.4 XML11DTDScannerImpl.java
--- XML11DTDScannerImpl.java 9 Dec 2002 18:51:29 -0000 1.4
+++ XML11DTDScannerImpl.java 31 Mar 2003 03:42:36 -0000
@@ -212,20 +212,6 @@
return dataok;
}
- /**
- * Normalize whitespace in an XMLString converting all whitespace
- * characters to space characters.
- */
- protected void normalizeWhitespace(XMLString value) {
- int end = value.offset + value.length;
- for (int i = value.offset; i < end; i++) {
- int c = value.ch[i];
- if (XML11Char.isXML11Space(c)) {
- value.ch[i] = ' ';
- }
- }
- }
-
// returns true if the given character is not
// valid with respect to the version of
// XML understood by this scanner.
-----------------------------
Michael Glavassevich
[EMAIL PROTECTED]
4B Computer Engineering
University of Waterloo
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]