On Thu, 2005-05-05 at 15:24 +0100, John Wilson wrote: >I'm not sure I follow this either :) > >Currently we emit an XML declaration which says we are using >ISO8859-1 encoding.
The declaration generated depends upon the encoding in use by XmlWriter, no? write(PROLOG_START); write(canonicalizeEncoding(enc)); write(PROLOG_END); >Unicode code points in the range 0X00 to 0XFF >have the same value as the ISO8859-1 character values. If we wish to >send Unicode code points with values > 0XFF then we have to emit >character references (e.g. &x1FF;) > >If we were to change the encoding to UTF-8 or UTF-16 then we would >never have to emit character references (though we still could if we >wanted to). Like you say below, we'd still have to emit character references for Unicode code points not allowed in XML documents, yes? >The XML 1.0 spec forbids some Unicode code points from appearing in a >well formed XML document (only these code points are allowed: #x9 | >#xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] - >see section 2.2 of the spec.). Note that USASCII control characters >other than HT, CR and NL are not allowed. Using a character reference >doesn't make any difference <a>�</a> is not a well formed XML >document and should be rejected by an XML parser (MinML used not to >complain about this - later versions do). What range are these control characters in (e.g. < 0x20)? >There is another little wrinkle with CR and LF. An XML parser is >required to "normalise" line endings (see section 2.11 of the spec). >This normalisation involves replacing CR NL or CR with NL. This >normalisation does not occur if the CR is represented using a >character reference. > >So a correct XML writer should do the following: > >1/ refuse to write characters with Unicode code points which are not >allowed in an XML document. Do you suggest throwing an exception here, or writing a '?' character? >2/ replace characters with a Unicode code point which is not allowed >in the encoding being used with the appropriate character reference. For any random encoding, anyone know a good way of determining whether such a character is representable by said encoding? >3/ replace <,& and > with either the pre defined entities (< etc) >or with a character reference. We're already replacing them with pref-defined entities, so we're in good shape here. >4/ replace all CR characters with a character reference. We do this to keep them from getting normalized by the XML parser, I take it? Previously, we'd write them literally. >If we wanted to have the greatest possible chance of interoperating >we should emit no XML encoding declaration and replace code points >with values > 0X7F with character references. I agree with the part about replacing code points with values > 0x7f with character references (see exchange with Jochen). Can non-ASCII encodings be determined by the parser using the BOM, or some such heuristic? Would we write all non-ASCII encoding as UTF-8? I'm attaching a patch as a discussion piece which implements some of the discussion from this thread.
Index: XmlWriter.java =================================================================== RCS file: /home/cvs/ws-xmlrpc/src/java/org/apache/xmlrpc/XmlWriter.java,v retrieving revision 1.13 diff -u -u -r1.13 XmlWriter.java --- XmlWriter.java 2 May 2005 04:22:21 -0000 1.13 +++ XmlWriter.java 5 May 2005 21:36:38 -0000 @@ -40,6 +40,8 @@ * * @author <a href="mailto:[EMAIL PROTECTED]">Hannes Wallnoefer</a> * @author Daniel L. Rall + * @see <a href="http://www.xml.com/axml/testaxml.htm">Tim Bray's + * Annotated XML Spec</a> */ class XmlWriter extends OutputStreamWriter { @@ -246,6 +248,17 @@ } /** + * Writes characters like '\r' (0xd) as "&#13;". + */ + private void writeCharacterReference(char c) + throws IOException + { + write("&#"); + write(String.valueOf((int) c)); + write(';'); + } + + /** * * @param elem * @throws IOException @@ -303,10 +316,13 @@ switch (c) { case '\t': - case '\r': case '\n': write(c); break; + case '\r': + // Avoid normalization of CR to LF. + writeCharacterReference(c); + break; case '<': write(LESS_THAN_ENTITY); break; @@ -317,38 +333,18 @@ write(AMPERSAND_ENTITY); break; default: - if (c < 0x20 || c > 0x7f) + // Though the XML spec requires XML parsers to support + // Unicode, not all such code points are valid in XML + // documents. Additionally, previous to 2003-06-30 + // the XML-RPC spec only allowed ASCII data (in + // <string> elements). For interoperability with + // clients rigidly conforming to the pre-2003 version + // of the XML-RPC spec, we entity encode characters + // outside of the valid range for ASCII, too. + if (c > 0x7f || !isValidXMLChar(c)) { - // Though the XML-RPC spec allows any ASCII - // characters except '<' and '&', the XML spec - // does not allow this range of characters, - // resulting in a parse error from most XML - // parsers. However, the XML spec does require - // XML parsers to support UTF-8 and UTF-16. - if (isUnicode) - { - if (c < 0x20) - { - // Entity escape the character. - write("&#"); - // ### Do we really need the String conversion? - write(String.valueOf((int) c)); - write(';'); - } - else // c > 0x7f - { - // Write the character in our encoding. - write(new String(String.valueOf(c).getBytes(enc))); - } - } - else - { - throw new XmlRpcException(0, "Invalid character data " - + "corresponding to XML " - + "entity &#" - + String.valueOf((int) c) - + ';'); - } + // Replace the code point with a character reference. + writeCharacterReference(c); } else { @@ -358,6 +354,35 @@ } } + /** + * Section 2.2 of the XML spec describes which Unicode code points + * are valid in XML: + * + * <blockquote><code>#x9 | #xA | #xD | [#x20-#xD7FF] | + * [#xE000-#xFFFD] | [#x10000-#x10FFFF]</code></blockquote> + * + * Code points outside this set must be entity encoded to be + * represented in XML. + * + * @param c The character to inspect. + * @return Whether the specified character is valid in XML. + */ + private static final boolean isValidXMLChar(char c) + { + switch (c) + { + case 0x9: + case 0xa: // line feed, '\n' + case 0xd: // carriage return, '\r' + return true; + + default: + return ( (0x20 < c && c <= 0xd7ff) || + (0xe000 < c && c <= 0xfffd) || + (0x10000 < c && c <= 0x10ffff) ); + } + } + protected static void setTypeDecoder(TypeDecoder newTypeDecoder) { typeDecoder = newTypeDecoder;